[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602553#comment-14602553
 ] 

Sebastian Nagel commented on NUTCH-2038:
----------------------------------------

Great [~asitangm]! I'll tried to run it via parsechecker and also within a 
small crawl.
* there is still one {{e.printStackTrace();}} :)
* if the plugin is activated in plugin.included but not configured:
{noformat}
2015-06-26 09:33:24,174 ERROR naivebayes.NaiveBayesParseFilter - ParseFilter: 
NaiveBayes: trainfile or wordlist not set in the 
parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist
2015-06-26 09:33:24,175 WARN  parse.ParseSegment - Error parsing: 
file:/home/wastl/work/websearch/crawler/nutch/src/plugin/parse-exorbyte/sample/subdocuments1-html5.html:
 java.lang.IllegalArgumentException: ParseFilter: NaiveBayes: trainfile or 
wordlist not set in the parsefilte.naivebayes.trainfile or 
parsefilte.naivebayes.wordlist
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:120)
{noformat}
A plugin propagated in the description of plugin.includes should optimally work 
out-of-the-box. You could add train/word file templates to conf/ containing a 
few trivial ham/spam examples. They are then instantiated and installed into 
runtime/ and users could just modify them.
* there should be a clear error message if a configured file fails to load 
(e.g., "Failed to load naivebayes-train.txt configured in 
parsefilter.naivebayes.trainfile: ...") instead of
{noformat}
Exception in thread "main" java.lang.NullPointerException
        at java.io.Reader.<init>(Reader.java:78)
        at java.io.BufferedReader.<init>(BufferedReader.java:94)
        at java.io.BufferedReader.<init>(BufferedReader.java:109)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:129)
{noformat}
* finally, the JobRunner crashed with::
{noformat}
2015-06-26 09:48:50,762 INFO  naivebayes.NaiveBayesParseFilter - Training the 
Naive Bayes Model
2015-06-26 09:48:50,764 WARN  mapred.LocalJobRunner - job_local1978281032_0001
java.lang.Exception: java.lang.NoClassDefFoundError: 
org/apache/lucene/analysis/Analyzer
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:94)
        at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:142)
{noformat}
 That's probably caused because the dependencies are not listed in the 
plugin.xml.

> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
>                 Key: NUTCH-2038
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2038
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher, injector, parser
>            Reporter: Asitang Mishra
>            Assignee: Chris A. Mattmann
>              Labels: memex, nutch
>             Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages. 
> Classify the parse text and decide if the parent page is relevant. If 
> relevant then don't filter the outlinks. If irrelevant then go thru each 
> outlink and see if the url contains any of the important words from a list. 
> If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to