[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610226#comment-14610226 ] Markus Jelsma commented on NUTCH-2038: -- Ah no, this crazyness, i know what you mean! Opening a new issue and keep it in core ivy is fine. Great work guys! Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610201#comment-14610201 ] Chris A. Mattmann commented on NUTCH-2038: -- Hey [~markus.jel...@openindex.io] yeah we tried to insulate the dependencies to being a plugin, but for whatever reason when doing so, it doesn't seem to work? Seb and Asitang and I tried it - I think it has to do with the PluginClasspathLoading. Anyways, this is something we should try and figure out for 1.11 release to see if we can get out of there, but not a blocker now since the functionality is pretty neat and since we're just talking about more jar files (that don't conflict with anything) in the lib directory. [~asitang] can you open a new issue to try and figure out how to get the dependencies into the plugin's ivy and to still make it work? Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610198#comment-14610198 ] Chris A. Mattmann commented on NUTCH-2038: -- hey [~markus.jel...@openindex.io] yeah I rolled back NUTCH-2052 (added it on accident). However it's out of there now! Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610439#comment-14610439 ] Asitang Mishra commented on NUTCH-2038: --- Sure... Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610476#comment-14610476 ] Asitang Mishra commented on NUTCH-2038: --- NUTCH-2057 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609625#comment-14609625 ] Hudson commented on NUTCH-2038: --- SUCCESS: Integrated in Nutch-trunk #3186 (See [https://builds.apache.org/job/Nutch-trunk/3186/]) Fix for NUTCH-2038: add mattmann to template to make RSS links relevant. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688555) * /nutch/trunk/conf/naivebayes-wordlist.txt.template Add mattmann for unit test for NUTCH-2038 to pass. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688553) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/StaticFieldIndexer.java * /nutch/trunk/src/plugin/index-static/src/test/org/apache/nutch/indexer/staticfield/TestStaticFieldIndexerTest.java * /nutch/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine Add mattmann for unit test for NUTCH-2038 to pass. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688552) * /nutch/trunk/CHANGES.txt * /nutch/trunk/conf/naivebayes-wordlist.txt.template * /nutch/trunk/src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/StaticFieldIndexer.java * /nutch/trunk/src/plugin/index-static/src/test/org/apache/nutch/indexer/staticfield/TestStaticFieldIndexerTest.java Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609639#comment-14609639 ] Markus Jelsma commented on NUTCH-2038: -- I have tried to search the comments here, but can anyone explain why lucene and mahout are in the core ivy xml when this is jus a plugin? Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609645#comment-14609645 ] Markus Jelsma commented on NUTCH-2038: -- Also, you have committed NUTCH-2052 :P Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609618#comment-14609618 ] Chris A. Mattmann commented on NUTCH-2038: -- it's b/c I didn't include the updates to the *.template file for the wordlist! Passing now in r1688555. {noformat} BUILD SUCCESSFUL Total time: 7 minutes 52 seconds [mattmann@imagecat nutch1.11]$ {noformat} That's on a fresh machine with a new Nutch trunk install. I also have: https://builds.apache.org/job/Nutch-trunk/3186/ Scheduled so should go ahead and build fine then too. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609295#comment-14609295 ] Asitang Mishra commented on NUTCH-2038: --- I tried with adding the jars to the main ivy.xml, It works there fine. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609542#comment-14609542 ] Asitang Mishra commented on NUTCH-2038: --- woot!!1 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609539#comment-14609539 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/42 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609555#comment-14609555 ] Hudson commented on NUTCH-2038: --- FAILURE: Integrated in Nutch-trunk #3184 (See [https://builds.apache.org/job/Nutch-trunk/3184/]) Updates to make tests pass related to NUTCH-2038: Naive Bayes classifier based html Parse filter (for filtering outlinks) this closes #42. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688549) * /nutch/trunk/conf/naivebayes-train.txt.template * /nutch/trunk/conf/naivebayes-wordlist.txt.template * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/default.properties * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/plugin/parsefilter-naivebayes/ivy.xml Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609575#comment-14609575 ] Chris A. Mattmann commented on NUTCH-2038: -- so this passed locally for me - wondering if it's b/c the file existed already? Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609538#comment-14609538 ] Chris A. Mattmann commented on NUTCH-2038: -- OK I found a few more issues: 1. Adding conf/naivebayes-*.txt.template didn't fully fix it, b/c the conf/nutch-default.xml needed updating to the wordlist property [done] 2. TestFeedParser in parse-tika failed b/c the default training and wordlist didn't identify the expected 2 outlinks as relevant. I've updated the conf/naivebayes-wordlist.txt to address this. 3. the files generated by Mahout should be put e.g., into crawl_dir/parsefilter-naivebayes, e.g., - temp - vectors - labelindex - model - outseq - .labelindex.crc [ to be done in a future patch, [~asitang] please file an issue for this] 4. Moved dependencies out of the plugin and into main Nutch ivy.xml [probably not the best, but doesn't hurt too much for now and allows to get around the plugin issue - done] 5. updated default.properties to declare the parsefilter-naivebayes [done] Going to commit the update now. All tests pass locally and tested. Thanks for everyone's review. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609282#comment-14609282 ] Sebastian Nagel commented on NUTCH-2038: Hi [~asitang], I was able to reproduce the exception. To train the classifier a MapReduce job is launched: * it obviously does not have the classes of the plugin at hand. Each plugin uses its own class loader (see [[1|http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading]]). Don't know whether it's possible to make the plugin classes available to the training job. * if the classifier is trained inside the parse step of a crawl, this will mean that a job/task launches another job. Sounds awkward. Again: I don't know whether this will work at all in local and in distributed mode. Sorry, that I haven't seen this dependency on running a MapReduce job before. Unfortunately, Mahout does not provide a non-MapReduce version of Naive Bayes ([[2|https://mahout.apache.org/users/basics/algorithms.html]]). Needs some thoughts to get a solution. In doubt, the training step could be run separately beforehand. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605661#comment-14605661 ] Asitang Mishra commented on NUTCH-2038: --- Yup dint fail for me as well.. gonna list all the libraries in plugin.xml now Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605643#comment-14605643 ] Chris A. Mattmann commented on NUTCH-2038: -- Ugh, On #2, I guess I missed setting commons-cli from mahout in the plugin's ivy.xml too. RE: #1...weird, that test didn't fail for me after I added mahout's common-cli to the ivy.xml. [~asitang] can you help me look at this? Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605729#comment-14605729 ] Sebastian Nagel commented on NUTCH-2038: Yep, this fixes problem #2. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605716#comment-14605716 ] Sebastian Nagel commented on NUTCH-2038: (#1) The unit tests fail if build and tests are run from a clean source tree: * 2 files from pull request #39 are missing in the SVN commit. Ideally, they should be added as {noformat} conf/naivebayes-train.txt.template conf/naivebayes-wordlist.txt.template {noformat} As templates they get only instantiated/copied once, and are not overwritten if a user changes the *.txt in-place. * anyway, it would be nice to catch all IO errors related to an incomplete configuration and show a nice error message, e.g., Failed to load naivebayes-train.txt configured in parsefilter.naivebayes.trainfile: ... (#2) the commons-cli is properly installed in {{build/plugins/parsefilter-naivebayes/}} resp. in the corresponding runtime folder or job file: It should be enough to add all indirect dependencies to the plugin.xml Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605681#comment-14605681 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/40 NUTCH-2038 added all the jars in plugin.xml You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/40.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #40 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:51:58Z Patch 5.0 for NUTCH-2038 commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:22Z Patch 5.0 for NUTCH-2038 commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:52Z Patch 5.0 for NUTCH-2038 commit aba64fc941ed7616153d19410dbe9b9a0f8ef387 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T00:03:43Z Patch 5.0 for NUTCH-2038 commit 71be15df81222adc6b58b6308e1dac7db23b6386 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T04:21:38Z Patch 5.1 for NUTCH-2038 commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T14:27:02Z Patch 5.2 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606348#comment-14606348 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/41 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606592#comment-14606592 ] Asitang Mishra commented on NUTCH-2038: --- Hi [~wastl-nagel], I am facing the following issue when running in local (please test the latest pull for this). This I even faced in the pull #40 here. Please test and see if you are facing it too. I have added all the dependencies, dont seem to understand why it's still givin class not found!!! java.lang.Exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.ClassNotFoundException: org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper at java.net.URLClassLoader$1.run(URLClassLoader.java:372) at java.net.URLClassLoader$1.run(URLClassLoader.java:361) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:360) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:340) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855) ... 9 more 2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured while training:: java.lang.IllegalStateException: Job failed! at org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.init(HtmlParseFilters.java:35) at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104) at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366) at org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606359#comment-14606359 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/42 NUTCH-2038 minor changes and suggestions by Sebastian. You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/42.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #42 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:51:58Z Patch 5.0 for NUTCH-2038 commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:22Z Patch 5.0 for NUTCH-2038 commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:52Z Patch 5.0 for NUTCH-2038 commit aba64fc941ed7616153d19410dbe9b9a0f8ef387 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T00:03:43Z Patch 5.0 for NUTCH-2038 commit 71be15df81222adc6b58b6308e1dac7db23b6386 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T04:21:38Z Patch 5.1 for NUTCH-2038 commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T14:27:02Z Patch 5.2 for NUTCH-2038 commit 8f45e634c942df66ea9c1ee775bb216d35fabb87 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:23:39Z patch 6.0 for NUTCH-2038 commit 97278c5a09f5d4391473185d2268c7b26f151120 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:24:34Z patch 6.0 for NUTCH-2038 commit 9b876bc8cbad902b094d696e3df751d9f163e4b3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:25:06Z patch 6.0 for NUTCH-2038 commit 866486e6be337e8a1e0e5209642649a1834278d3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:35:24Z patch 6.1 for NUTCH-2038 commit dd159175822a476cc5889da71a19272cf733e011 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T20:39:55Z patch 6.2 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606228#comment-14606228 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/40 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606234#comment-14606234 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/41 NUTCH-2038 --added specific IOException messages --added files: conf/naivebayes-train.txt.template conf/naivebayes-wordlist.txt.template You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/41.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #41 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:51:58Z Patch 5.0 for NUTCH-2038 commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:22Z Patch 5.0 for NUTCH-2038 commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:52Z Patch 5.0 for NUTCH-2038 commit aba64fc941ed7616153d19410dbe9b9a0f8ef387 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T00:03:43Z Patch 5.0 for NUTCH-2038 commit 71be15df81222adc6b58b6308e1dac7db23b6386 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T04:21:38Z Patch 5.1 for NUTCH-2038 commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T14:27:02Z Patch 5.2 for NUTCH-2038 commit 8f45e634c942df66ea9c1ee775bb216d35fabb87 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:23:39Z patch 6.0 for NUTCH-2038 commit 97278c5a09f5d4391473185d2268c7b26f151120 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:24:34Z patch 6.0 for NUTCH-2038 commit 9b876bc8cbad902b094d696e3df751d9f163e4b3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:25:06Z patch 6.0 for NUTCH-2038 commit 866486e6be337e8a1e0e5209642649a1834278d3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T19:35:24Z patch 6.1 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604998#comment-14604998 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/37 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604999#comment-14604999 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/38 NUTCH-2038 Made new changes suggested and corrections by Sebastian. You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/38.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #38 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:51:58Z Patch 5.0 for NUTCH-2038 commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:22Z Patch 5.0 for NUTCH-2038 commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:52Z Patch 5.0 for NUTCH-2038 commit aba64fc941ed7616153d19410dbe9b9a0f8ef387 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T00:03:43Z Patch 5.0 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605088#comment-14605088 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/38#discussion_r33432911 --- Diff: src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java --- @@ -0,0 +1,204 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.parsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.util.StringUtils; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of parsefilter.naivebayes.trainfile) and if found irrelevent it + * gives the link a second chance if it contains any of the words from the list + * given in parsefilter.naivebayes.wordlist. CAUTION: Set the parser.timeout to + * -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = parsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = parsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text + :: + + StringUtils.stringifyException(e)); +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model file already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605087#comment-14605087 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user chrismattmann commented on a diff in the pull request: https://github.com/apache/nutch/pull/38#discussion_r33432889 --- Diff: src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java --- @@ -0,0 +1,204 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.parsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.util.StringUtils; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of parsefilter.naivebayes.trainfile) and if found irrelevent it + * gives the link a second chance if it contains any of the words from the list + * given in parsefilter.naivebayes.wordlist. CAUTION: Set the parser.timeout to + * -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = parsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = parsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block --- End diff -- I asked for this to be removed. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605169#comment-14605169 ] Hudson commented on NUTCH-2038: --- FAILURE: Integrated in Nutch-trunk #3181 (See [https://builds.apache.org/job/Nutch-trunk/3181/]) fix for NUTCH-2038: Naive Bayes classifier based html Parse filter (for filtering outlinks) contributed by Asitang Mishra asit...@gmail.com this closes #39 (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688085) * /nutch/trunk/CHANGES.txt fix for NUTCH-2038: Naive Bayes classifier based html Parse filter (for filtering outlinks) contributed by Asitang Mishra asit...@gmail.com this closes #39 (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688084) * /nutch/trunk/.gitignore * /nutch/trunk/build.xml * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/ivy/ivy.xml * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/parsefilter-naivebayes * /nutch/trunk/src/plugin/parsefilter-naivebayes/build.xml * /nutch/trunk/src/plugin/parsefilter-naivebayes/ivy.xml * /nutch/trunk/src/plugin/parsefilter-naivebayes/plugin.xml * /nutch/trunk/src/plugin/parsefilter-naivebayes/src * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesClassifier.java * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java * /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/package-info.java Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605099#comment-14605099 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/39 NUTCH-2038 Removed the TODO comments You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/39.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #39 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:51:58Z Patch 5.0 for NUTCH-2038 commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:22Z Patch 5.0 for NUTCH-2038 commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-28T23:53:52Z Patch 5.0 for NUTCH-2038 commit aba64fc941ed7616153d19410dbe9b9a0f8ef387 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T00:03:43Z Patch 5.0 for NUTCH-2038 commit 71be15df81222adc6b58b6308e1dac7db23b6386 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-29T04:21:38Z Patch 5.1 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605093#comment-14605093 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang commented on a diff in the pull request: https://github.com/apache/nutch/pull/38#discussion_r33433090 --- Diff: src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java --- @@ -0,0 +1,204 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the License); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an AS IS BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.parsefilter.naivebayes; + +import org.slf4j.Logger; +import org.slf4j.LoggerFactory; +import org.w3c.dom.DocumentFragment; +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.fs.FileSystem; +import org.apache.hadoop.fs.Path; +import org.apache.hadoop.util.StringUtils; +import org.apache.nutch.parse.HTMLMetaTags; +import org.apache.nutch.parse.HtmlParseFilter; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.apache.nutch.parse.ParseData; +import org.apache.nutch.parse.ParseResult; +import org.apache.nutch.parse.ParseStatus; +import org.apache.nutch.parse.ParseText; +import org.apache.nutch.protocol.Content; + +import java.io.Reader; +import java.io.BufferedReader; +import java.io.IOException; +import java.util.ArrayList; + +/** + * Html Parse filter that classifies the outlinks from the parseresult as + * relevant or irrelevant based on the parseText's relevancy (using a training + * file where you can give positive and negative example texts see the + * description of parsefilter.naivebayes.trainfile) and if found irrelevent it + * gives the link a second chance if it contains any of the words from the list + * given in parsefilter.naivebayes.wordlist. CAUTION: Set the parser.timeout to + * -1 or a bigger value than 30, when using this classifier. + */ +public class NaiveBayesParseFilter implements HtmlParseFilter { + + private static final Logger LOG = LoggerFactory + .getLogger(NaiveBayesParseFilter.class); + + public static final String TRAINFILE_MODELFILTER = parsefilter.naivebayes.trainfile; + public static final String DICTFILE_MODELFILTER = parsefilter.naivebayes.wordlist; + + private Configuration conf; + private String inputFilePath; + private String dictionaryFile; + private ArrayListString wordlist = new ArrayListString(); + + public boolean filterParse(String text) { + +try { + return classify(text); +} catch (IOException e) { + // TODO Auto-generated catch block + LOG.error(Error occured while classifying:: + text + :: + + StringUtils.stringifyException(e)); +} + +return false; + } + + public boolean filterUrl(String url) { + +return containsWord(url, wordlist); + + } + + public boolean classify(String text) throws IOException { + +// if classified as relevent 1 then return true +if (NaiveBayesClassifier.classify(text).equals(1)) + return true; +return false; + } + + public void train() throws Exception { +// check if the model file exists, if it does then don't train +if (!FileSystem.get(conf).exists(new Path(model))) { + LOG.info(Training the Naive Bayes Model); + NaiveBayesClassifier.createModel(inputFilePath); +} else { + LOG.info(Model file already exists. Skipping training.); +} + } + + public boolean containsWord(String url, ArrayListString wordlist) { +for (String word : wordlist) { + if (url.contains(word)) { +return true; + } +} + +return false; + } + + public void setConf(Configuration conf) { +this.conf = conf; +inputFilePath = conf.get(TRAINFILE_MODELFILTER); +
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605098#comment-14605098 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/38 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605132#comment-14605132 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/39 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605120#comment-14605120 ] Chris A. Mattmann commented on NUTCH-2038: -- Tests fail in TestParserFactory: {noformat} org/apache/commons/cli2/Option java.lang.NoClassDefFoundError: org/apache/commons/cli2/Option at org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:93) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:148) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441) at org.apache.nutch.parse.HtmlParseFilters.init(HtmlParseFilters.java:34) at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:244) at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163) at org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136) at org.apache.nutch.parse.TestParserFactory.testGetParsers(TestParserFactory.java:63) {noformat} I'll fix it. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602553#comment-14602553 ] Sebastian Nagel commented on NUTCH-2038: Great [~asitangm]! I'll tried to run it via parsechecker and also within a small crawl. * there is still one {{e.printStackTrace();}} :) * if the plugin is activated in plugin.included but not configured: {noformat} 2015-06-26 09:33:24,174 ERROR naivebayes.NaiveBayesParseFilter - ParseFilter: NaiveBayes: trainfile or wordlist not set in the parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist 2015-06-26 09:33:24,175 WARN parse.ParseSegment - Error parsing: file:/home/wastl/work/websearch/crawler/nutch/src/plugin/parse-exorbyte/sample/subdocuments1-html5.html: java.lang.IllegalArgumentException: ParseFilter: NaiveBayes: trainfile or wordlist not set in the parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:120) {noformat} A plugin propagated in the description of plugin.includes should optimally work out-of-the-box. You could add train/word file templates to conf/ containing a few trivial ham/spam examples. They are then instantiated and installed into runtime/ and users could just modify them. * there should be a clear error message if a configured file fails to load (e.g., Failed to load naivebayes-train.txt configured in parsefilter.naivebayes.trainfile: ...) instead of {noformat} Exception in thread main java.lang.NullPointerException at java.io.Reader.init(Reader.java:78) at java.io.BufferedReader.init(BufferedReader.java:94) at java.io.BufferedReader.init(BufferedReader.java:109) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:129) {noformat} * finally, the JobRunner crashed with:: {noformat} 2015-06-26 09:48:50,762 INFO naivebayes.NaiveBayesParseFilter - Training the Naive Bayes Model 2015-06-26 09:48:50,764 WARN mapred.LocalJobRunner - job_local1978281032_0001 java.lang.Exception: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354) Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:94) at org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:142) {noformat} That's probably caused because the dependencies are not listed in the plugin.xml. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603977#comment-14603977 ] Asitang Mishra commented on NUTCH-2038: --- Oh Great will fix them all :) Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600788#comment-14600788 ] Sebastian Nagel commented on NUTCH-2038: Great, thanks! Ideally the model is loaded from setConf() which is executed once per job/task. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601499#comment-14601499 ] Chris A. Mattmann commented on NUTCH-2038: -- OK so it looks like the latest patch is commitable. Asitang is addressing Seb's comments. I will go ahead and get this committed later today after I see the final PR/patch. Great work. Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602110#comment-14602110 ] ASF GitHub Bot commented on NUTCH-2038: --- GitHub user asitang opened a pull request: https://github.com/apache/nutch/pull/37 NUTCH-2038 Made changes suggested by Sebastial Nagel. You can merge this pull request into a Git repository by running: $ git pull https://github.com/asitang/nutch NUTCH-2038 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/37.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #37 commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:11:42Z patch 1.0 for NUTCH-2038 commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:14:37Z patch 1.0 for NUTCH-2038 commit 711f44d8d4af51538ff1764145ac743445b6f43b Author: Asitang Mishra asit...@gmail.com Date: 2015-06-17T16:35:28Z patch 1.0 for NUTCH-2038 commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a Author: Asitang Mishra asit...@gmail.com Date: 2015-06-18T15:09:30Z final commir for pattch 1.0 commit cca768bc1c790a976594136433485fe899465cb8 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-19T20:13:34Z Patch 2.0 for NUTCH-2038 commit 0e80bf471b7d40965cf3bdad908252f5ce577d85 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:45:50Z commit for 3.0 patch of NUTCH-2038 commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:46:46Z commit for 3.0 patch of NUTCH-2038 commit 3a7bf466c76e8cffef96063101a39a77c328d657 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:55:22Z commit for 3.1 patch of NUTCH-2038 commit ae89456e9f4078111653273fe0ac52c26c568c36 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T15:58:12Z commit for 3.2 patch of NUTCH-2038 commit ae639ec40263fafbd6c0273c619d425ee482f7f0 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-24T17:31:09Z commit for 3.3 patch of NUTCH-2038 commit 5ba14790c1367deeb54d4d61f87be3d602cecedf Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T22:59:45Z patch 4.0 for NUTCH-2038 commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:00:40Z patch 4.0 for NUTCH-2038 commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:05:20Z patch 4.1 for NUTCH-2038 commit 830f05bfe77abf79b2877c2a9c388fa24b3df526 Author: Asitang Mishra asit...@gmail.com Date: 2015-06-25T23:07:44Z patch 4.1 for NUTCH-2038 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602108#comment-14602108 ] ASF GitHub Bot commented on NUTCH-2038: --- Github user asitang closed the pull request at: https://github.com/apache/nutch/pull/36 Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269 ] Sebastian Nagel commented on NUTCH-2038: Hi [~asitang], the latest pull request #36 looks good. - maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 - is this statement still true? bq. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. - afaics, the way the model is generated, stored and loaded needs a review: -* it should be read/generated once and then cached in memory, -* writing the model to disk is likely to become painful in distributed mode with concurrent tasks. - cosmetics: -* exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() -* code formatting, see [[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing]] Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)
[ https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419 ] Asitang Mishra commented on NUTCH-2038: --- maybe rename the plugin to parsefilter-naivebayes for simplicity and in advance of NUTCH-1482 Will do that is this statement still true? CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this classifier. The first ever call to parse filter takes a bit more time because the training is done and model is created. So, time out should be a little more. Does not take much time after this. afaics, the way the model is generated, stored and loaded needs a review: it should be read/generated once and then cached in memory, writing the model to disk is likely to become painful in distributed mode with concurrent tasks. The model is created during the parsing of the first fetched page of the very first parse job. After that it checks if the model file already present or not. The model file is being read each time the classify() function is called, will change that and store the model all the way thru for a single parse job. cosmetics: exceptions are properly logged via LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in stdout/stderr as of e.printStackTrace() code formatting, see [1] will do that Naive Bayes classifier based html Parse filter (for filtering outlinks) --- Key: NUTCH-2038 URL: https://issues.apache.org/jira/browse/NUTCH-2038 Project: Nutch Issue Type: New Feature Components: fetcher, injector, parser Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, nutch Fix For: 1.11 A html parse filter that will filter out the outlinks in two stages. Classify the parse text and decide if the parent page is relevant. If relevant then don't filter the outlinks. If irrelevant then go thru each outlink and see if the url contains any of the important words from a list. If it does then let it pass. -- This message was sent by Atlassian JIRA (v6.3.4#6332)