Author: Alexander Barkov
Email: 
Message:
> Thanks for your quick answer.
> 
> I tried to add the NoIndexIf but i cannot get it to work.
> 
> I used the indexer.conf default file, and added the two following lines at 
> the end of that file : 
> Server http://www.wearethelous.com/feed/
> NoIndexIf Content-Type application/rss+xml

I tried the same thing, and it seems to work fine.
This page is not returned in search results.

If I remove the NoIndexIf command, this page IS returned by search results.


Note, indexer shows the URL in its log, because it still must
download this URL to know its content type.
But the fact that you can see the "SectionFilter:..." line in the log
tells that indexer marks it as "not for indexing" and thus stores no data into 
the underlying tables cachedcopy and bdicti, so "indexer --index" later does 
see it when creating the search index.

> 
> I got the following log : 
> 
> [71598]{--} Clearing
> [71598]{--} Clearing done       0.01
> [71600]{--} indexer from mnogosearch-3.4.1-mysql-pqsql started with 
> '/etc/mnogosearch/indexer.conf'
> [71600]{01} URL: http://www.wearethelous.com/feed/
> [71600]{01} Server Path Allow 'http://www.wearethelous.com/feed/'
> [71600]{01} Allow by default
> [71600]{01} ROBOTS: http://www.wearethelous.com/robots.txt
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Length: 67
> [71600]{01} Response.Content-Type: text/plain
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:46 GMT
> [71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; 
> rel="https://api.w.org/";
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 475
> [71600]{01} Response.ResponseTime: 2261
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Status: 200
> [71600]{01} Response.URL: http://www.wearethelous.com/robots.txt
> [71600]{01} Response.URL_ID: 1928115922
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Request.Accept-Encoding: gzip,deflate,compress
> [71600]{01} Request.Host: www.wearethelous.com
> [71600]{01} Request.User-Agent: MnoGoSearch/3.4.1
> [71600]{01} Response.body: 
> [71600]{01} Response.Charset: 
> [71600]{01} Response.Connection: close
> [71600]{01} Response.Content-Encoding: gzip
> [71600]{01} Response.Content-Language: 
> [71600]{01} Response.Content-Length: 2337
> [71600]{01} Response.Content-Type: application/rss+xml
> [71600]{01} Response.crc32: 0
> [71600]{01} Response.crc32old: 0
> [71600]{01} Response.Date: Wed, 12 Oct 2016 20:42:48 GMT
> [71600]{01} Response.ETag: "7059155a990290887650add31475f88e"
> [71600]{01} Response.Hops: 0
> [71600]{01} Response.ID: 5
> [71600]{01} Response.ilinktext: 
> [71600]{01} Response.Last-Modified: Thu, 29 Sep 2016 12:48:50 GMT
> [71600]{01} Response.Link: <http://www.wearethelous.com/wp-json/>; 
> rel="https://api.w.org/";
> [71600]{01} Response.MaxDocPerSite: 0
> [71600]{01} Response.MaxHops: 256
> [71600]{01} Response.meta.description: 
> [71600]{01} Response.meta.keywords: 
> [71600]{01} Response.msg.from: 
> [71600]{01} Response.msg.subject: 
> [71600]{01} Response.msg.to: 
> [71600]{01} Response.PrevStatus: 0
> [71600]{01} Response.ResponseLine: HTTP/1.1 200 OK
> [71600]{01} Response.ResponseSize: 2842
> [71600]{01} Response.ResponseTime: 1455
> [71600]{01} Response.Server: Apache/2.2.31 (Unix) mod_ssl/2.2.31 
> OpenSSL/1.0.1e-fips mod_bwlimited/1.4
> [71600]{01} Response.Server-Charset: utf-8
> [71600]{01} Response.Server_id: -2050898686
> [71600]{01} Response.Status: 200
> [71600]{01} Response.title: 
> [71600]{01} Response.URL: http://www.wearethelous.com/feed/
> [71600]{01} Response.url.file: 
> [71600]{01} Response.url.host: 
> [71600]{01} Response.url.path: 
> [71600]{01} Response.url.proto: 
> [71600]{01} Response.URL_ID: -2050898686
> [71600]{01} Response.Vary: Accept-Encoding,User-Agent
> [71600]{01} Response.X-Powered-By: PHP/5.5.29
> [71600]{01} Response.X-Robots-Tag: noindex, follow
> [71600]{01} Status: 200 OK
> [71600]{01} Guesser: Lang: , Charset: utf-8
> [71600]{01} SectionFilter: NoIndexIf Match Wild Insensitive 'Content-Type' 
> 'application/rss+xml'
> [71600]{01} Flushing word cache
> [71600]{01} Flushing word cache done    0.00
> [71600]{01} Done (4 seconds, 1 documents, 2842 bytes,  0.69 Kbytes/sec.)
> 
> I see that the section filter talks about the NoIndexIf filter that i added, 
> but the url is still indexed.
> So what can be wrong ?
> 
> Thanks in advance for your help.
> Fabien.
> 
> 
> > Hi,
> > 
> > > Hi all,
> > > 
> > > Is it possible to exclude certain mime types such as rss feeds ?
> > > 
> > 
> > This can be done using the NoIndexIf command:
> > 
> > http://www.mnogosearch.org/doc34/msearch-cmdref-noindexif.html
> > 
> > Put this command into indexer.conf to disallow a certain Content-Type:
> > 
> > NoIndexIf Content-Type application/rss+xml
> > 
> > 
> > Another option is to use NoIndexIf in a combination with a user defined 
> > section, to check raw content fragments:
> > 
> > http://www.mnogosearch.org/doc34/msearch-cmdref-section.html#cmdref-section-user-defined
> > 
> > The idea is to define a user section using a regex pattern to catch some 
> > known RSS text fragments, and then use NoIndexIf with this section.
> > 
> > 
> > > Thanks in advance,
> > > Fabien.
> > 

Reply: <http://www.mnogosearch.org/board/message.php?id=21792>

_______________________________________________
General mailing list
General@mnogosearch.org
http://lists.mnogosearch.org/listinfo/general

Reply via email to