[ 
https://issues.apache.org/jira/browse/NUTCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992502#comment-12992502
 ] 

Josh Pavel commented on NUTCH-966:
----------------------------------

A plugin that corrects the issue (again, thanks to Julien Nioche)

public class MetaNoIndexingFilter implements IndexingFilter {
    public static final Log LOG =
LogFactory.getLog(MetaNoIndexingFilter.class);

    private Configuration conf;

    public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
            CrawlDatum datum, Inlinks inlinks) throws IndexingException {
        // should rely on doc or parse metadata but nothing stored
        // by the html parser
        String text = parse.getText();
        String title = parse.getData().getTitle();
        if ((text == null || text.equals(""))
                && (title == null || title.equals(""))) {
            // no text -> no indexing
            return null;
        }
        return doc;
    }

    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    public Configuration getConf() {
        return this.conf;
    }

}

> Behavior of NOINDEX,FOLLOW is not intuitive
> -------------------------------------------
>
>                 Key: NUTCH-966
>                 URL: https://issues.apache.org/jira/browse/NUTCH-966
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>    Affects Versions: 1.2
>            Reporter: Josh Pavel
>            Priority: Minor
>
> If a page has NOINDEX,FOLLOW for the ROBOTS metatag, Nutch will still create 
> a document that can be found in the index via metatag or URL matching.  
> Instead, Nutch should rely on doc or parse metadata but nothing should be 
> stored by the html parser. (thanks to Julien Nioche for helping me to 
> understand the issue). 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to