Sebastian Nagel resolved NUTCH-2387.
    Resolution: Cannot Reproduce

Hi [~eyeris], I've verified that the linked HTML page gets deleted if 
indexer.delete.robots.noindex is true when the indexer is called.

If the problem persists, please reopen with details how the indexer is called. 

> Nutch should not index document with "noindex" meta
> ---------------------------------------------------
>                 Key: NUTCH-2387
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2387
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.13
>         Environment: Linux mint 18,
>            Reporter: Eyeris Rodriguez Rueda
>            Priority: Major
>              Labels: index, meta, robots,
>             Fix For: 1.16
> I'm using nutch 1.12 in local mode and solr 4.10.3.
> For some reason i have detected that nutch index document with "noindex" 
> robots meta.
>  I have use nutch script for a complete cycle: 
> bin/crawl -i urls/ crawl/ -2
> with this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ 
> After various testing the problem persist and aproximately 200 document with 
> this robots meta are indexed incorrectly.
> I have read the method configure in IndexerMapReduce.java class and it has a 
> line for that property but for some reason it is not doing appropiately.
> this.deleteRobotsNoIndex =  
> job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

This message was sent by Atlassian Jira

Reply via email to