[
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602553#comment-14602553
]
Sebastian Nagel commented on NUTCH-2038:
----------------------------------------
Great [~asitangm]! I'll tried to run it via parsechecker and also within a
small crawl.
* there is still one {{e.printStackTrace();}} :)
* if the plugin is activated in plugin.included but not configured:
{noformat}
2015-06-26 09:33:24,174 ERROR naivebayes.NaiveBayesParseFilter - ParseFilter:
NaiveBayes: trainfile or wordlist not set in the
parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist
2015-06-26 09:33:24,175 WARN parse.ParseSegment - Error parsing:
file:/home/wastl/work/websearch/crawler/nutch/src/plugin/parse-exorbyte/sample/subdocuments1-html5.html:
java.lang.IllegalArgumentException: ParseFilter: NaiveBayes: trainfile or
wordlist not set in the parsefilte.naivebayes.trainfile or
parsefilte.naivebayes.wordlist
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:120)
{noformat}
A plugin propagated in the description of plugin.includes should optimally work
out-of-the-box. You could add train/word file templates to conf/ containing a
few trivial ham/spam examples. They are then instantiated and installed into
runtime/ and users could just modify them.
* there should be a clear error message if a configured file fails to load
(e.g., "Failed to load naivebayes-train.txt configured in
parsefilter.naivebayes.trainfile: ...") instead of
{noformat}
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:78)
at java.io.BufferedReader.<init>(BufferedReader.java:94)
at java.io.BufferedReader.<init>(BufferedReader.java:109)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:129)
{noformat}
* finally, the JobRunner crashed with::
{noformat}
2015-06-26 09:48:50,762 INFO naivebayes.NaiveBayesParseFilter - Training the
Naive Bayes Model
2015-06-26 09:48:50,764 WARN mapred.LocalJobRunner - job_local1978281032_0001
java.lang.Exception: java.lang.NoClassDefFoundError:
org/apache/lucene/analysis/Analyzer
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:94)
at
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:142)
{noformat}
That's probably caused because the dependencies are not listed in the
plugin.xml.
> Naive Bayes classifier based html Parse filter (for filtering outlinks)
> -----------------------------------------------------------------------
>
> Key: NUTCH-2038
> URL: https://issues.apache.org/jira/browse/NUTCH-2038
> Project: Nutch
> Issue Type: New Feature
> Components: fetcher, injector, parser
> Reporter: Asitang Mishra
> Assignee: Chris A. Mattmann
> Labels: memex, nutch
> Fix For: 1.11
>
>
> A html parse filter that will filter out the outlinks in two stages.
> Classify the parse text and decide if the parent page is relevant. If
> relevant then don't filter the outlinks. If irrelevant then go thru each
> outlink and see if the url contains any of the important words from a list.
> If it does then let it pass.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)