[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

Asitang Mishra (JIRA) Wed, 17 Jun 2015 10:08:20 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14590099#comment-14590099
 ]


Asitang Mishra commented on NUTCH-2038:
---------------------------------------

Personal to-dos for the code (changes to be done):
1. Put a check to see if the filter is activated in the properties file. 
2. Refile the way "public boolean containsWord()" function works.
3. Delete/clean the extra files produced during classification Or, don't 
delete, just put a check if the model file exists, then don't train the model 
again (Will make this more efficient for the crawl script, as the model will be 
created only during the first parsing job).
4. Add comments, javadocs, proper nutch format, proper function and class 
names, property descriptions.
5. Add an example training file.


> Naive Bayes classifier based url filter
> ---------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A url filter that will filter out the urls (after the parsing stage,  will 
> keep only those urls that contain some "hot words" provided again in a list.) 
> from that pages that are classified irrelevant by the classifier (using a 
> model provided).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based url filter

Reply via email to