[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590099#comment-14590099
]
Asitang Mishra commented on NUTCH-2038:
---------------------------------------
Personal to-dos for the code (changes to be done):
1. Put a check to see if the filter is activated in the properties file.
2. Refile the way "public boolean containsWord()" function works.
3. Delete/clean the extra files produced during classification Or, don't
delete, just put a check if the model file exists, then don't train the model
again (Will make this more efficient for the crawl script, as the model will be
created only during the first parsing job).
4. Add comments, javadocs, proper nutch format, proper function and class
names, property descriptions.
5. Add an example training file.
> Naive Bayes classifier based url filter
> ---------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage, will
> keep only those urls that contain some "hot words" provided again in a list.)
> from that pages that are classified irrelevant by the classifier (using a
> model provided).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)