[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14595385#comment-14595385
]
Chris A. Mattmann commented on NUTCH-2038:
------------------------------------------
Hey Seb:
Well, the native URLFilter interface doesn't allow this. I was thinking of
having an AbstractURLFilter class that implements public boolean accept(String
url), to the effect of:
{noformat}
public abstract class AbstractURLFilter{
public AbstractURLFilter(){}
public void setConf(Configuration conf){
//initialize UrlDb reader
//initialize Tika
// other stuff
}
public boolean accept(String url){
CrawlDatum datum = this.findDatumForUrl(url); // see if it's been crawled
already
ParseData pData = this.findParseDataForUrl(url); // call Tika or look it up
return acceptUrl(url, datum, pData, // other stuff);
}
public abstract boolean acceptUrl(String url, CrawlDatum data, ParseData data,
/* etc. */);
}
{noformat}
> Naive Bayes classifier based url filter
> ---------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage, will
> keep only those urls that contain some "hot words" provided again in a list.)
> from that pages that are classified irrelevant by the classifier.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)