[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610226#comment-14610226
 ] 

Markus Jelsma commented on NUTCH-2038:
--

Ah no, this crazyness, i know what you mean! Opening a new issue and keep it in 
core ivy is fine.
Great work guys!

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610201#comment-14610201
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

Hey [~markus.jel...@openindex.io] yeah we tried to insulate the dependencies to 
being a plugin, but for whatever reason when doing so, it doesn't seem to work? 
Seb and Asitang and I tried it - I think it has to do with the 
PluginClasspathLoading. Anyways, this is something we should try and figure out 
for 1.11 release to see if we can get out of there, but not a blocker now since 
the functionality is pretty neat and since we're just talking about more jar 
files (that don't conflict with anything) in the lib directory.

[~asitang] can you open a new issue to try and figure out how to get the 
dependencies into the plugin's ivy and to still make it work?

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610198#comment-14610198
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

hey [~markus.jel...@openindex.io] yeah I rolled back NUTCH-2052 (added it on 
accident). However it's out of there now!

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610439#comment-14610439
 ] 

Asitang Mishra commented on NUTCH-2038:
---

Sure...

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14610476#comment-14610476
 ] 

Asitang Mishra commented on NUTCH-2038:
---

NUTCH-2057

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609625#comment-14609625
 ] 

Hudson commented on NUTCH-2038:
---

SUCCESS: Integrated in Nutch-trunk #3186 (See 
[https://builds.apache.org/job/Nutch-trunk/3186/])
Fix for NUTCH-2038: add mattmann to template to make RSS links relevant. 
(mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688555)
* /nutch/trunk/conf/naivebayes-wordlist.txt.template
Add mattmann for unit test for NUTCH-2038 to pass. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688553)
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/StaticFieldIndexer.java
* 
/nutch/trunk/src/plugin/index-static/src/test/org/apache/nutch/indexer/staticfield/TestStaticFieldIndexerTest.java
* 
/nutch/trunk/src/plugin/scoring-similarity/src/java/org/apache/nutch/scoring/similarity/cosine
Add mattmann for unit test for NUTCH-2038 to pass. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688552)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/conf/naivebayes-wordlist.txt.template
* 
/nutch/trunk/src/plugin/index-static/src/java/org/apache/nutch/indexer/staticfield/StaticFieldIndexer.java
* 
/nutch/trunk/src/plugin/index-static/src/test/org/apache/nutch/indexer/staticfield/TestStaticFieldIndexerTest.java


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609639#comment-14609639
 ] 

Markus Jelsma commented on NUTCH-2038:
--

I have tried to search the comments here, but can anyone explain why lucene and 
mahout are in the core ivy xml when this is jus a plugin?

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609645#comment-14609645
 ] 

Markus Jelsma commented on NUTCH-2038:
--

Also, you have committed NUTCH-2052 :P

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-07-01 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609618#comment-14609618
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

it's b/c I didn't include the updates to the *.template file for the wordlist! 
Passing now in r1688555.

{noformat}

BUILD SUCCESSFUL
Total time: 7 minutes 52 seconds
[mattmann@imagecat nutch1.11]$ 
{noformat}

That's on a fresh machine with a new Nutch trunk install. I also have:

https://builds.apache.org/job/Nutch-trunk/3186/

Scheduled so should go ahead and build fine then too.



 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609295#comment-14609295
 ] 

Asitang Mishra commented on NUTCH-2038:
---

I tried with adding the jars to the main ivy.xml, It works there fine.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609542#comment-14609542
 ] 

Asitang Mishra commented on NUTCH-2038:
---

woot!!1

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609539#comment-14609539
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/42


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609555#comment-14609555
 ] 

Hudson commented on NUTCH-2038:
---

FAILURE: Integrated in Nutch-trunk #3184 (See 
[https://builds.apache.org/job/Nutch-trunk/3184/])
Updates to make tests pass related to NUTCH-2038: Naive Bayes classifier based 
html Parse filter (for filtering outlinks) this closes #42. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688549)
* /nutch/trunk/conf/naivebayes-train.txt.template
* /nutch/trunk/conf/naivebayes-wordlist.txt.template
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/default.properties
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/plugin/parsefilter-naivebayes/ivy.xml


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609575#comment-14609575
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

so this passed locally for me - wondering if it's b/c the file existed already?

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609538#comment-14609538
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

OK I found a few more issues:

1. Adding conf/naivebayes-*.txt.template didn't fully fix it, b/c the 
conf/nutch-default.xml needed updating to the wordlist property [done]
2. TestFeedParser in parse-tika failed b/c the default training and wordlist 
didn't identify the expected 2 outlinks as relevant. I've updated the 
conf/naivebayes-wordlist.txt to address this.
3. the files generated by Mahout should be put e.g., into 
crawl_dir/parsefilter-naivebayes, e.g., 
   - temp
   - vectors
   - labelindex
   - model
   - outseq
   - .labelindex.crc
 [ to be done in a future patch, [~asitang] please file an issue for this]
4. Moved dependencies out of the plugin and into main Nutch ivy.xml [probably 
not the best, but doesn't hurt too much for now and allows to get around the 
plugin issue - done]
5. updated default.properties to declare the parsefilter-naivebayes [done]

Going to commit the update now. All tests pass locally and tested.

Thanks for everyone's review.



 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14609282#comment-14609282
 ] 

Sebastian Nagel commented on NUTCH-2038:


Hi [~asitang], I was able to reproduce the exception. To train the classifier a 
MapReduce job is launched:
* it obviously does not have the classes of the plugin at hand. Each plugin 
uses its own class loader (see 
[[1|http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading]]). 
Don't know whether it's possible to make the plugin classes available to the 
training job.
* if the classifier is trained inside the parse step of a crawl, this will mean 
that a job/task launches another job. Sounds awkward. Again: I don't know 
whether this will work at all in local and in distributed mode.

Sorry, that I haven't seen this dependency on running a MapReduce job before. 
Unfortunately, Mahout does not provide a non-MapReduce version of Naive Bayes 
([[2|https://mahout.apache.org/users/basics/algorithms.html]]). Needs some 
thoughts to get a solution. In doubt, the training step could be run separately 
beforehand.


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605661#comment-14605661
 ] 

Asitang Mishra commented on NUTCH-2038:
---

Yup dint fail for me as well.. gonna list all the libraries in plugin.xml now

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605643#comment-14605643
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

Ugh, On #2, I guess I missed setting commons-cli from mahout in the plugin's 
ivy.xml too. RE: #1...weird, that test didn't fail for me after I added 
mahout's common-cli to the ivy.xml. [~asitang] can you help me look at this?

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605729#comment-14605729
 ] 

Sebastian Nagel commented on NUTCH-2038:


Yep, this fixes problem #2.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605716#comment-14605716
 ] 

Sebastian Nagel commented on NUTCH-2038:


(#1) The unit tests fail if build and tests are run from a clean source tree:
* 2 files from pull request #39 are missing in the SVN commit. Ideally, they 
should be added as
{noformat}
conf/naivebayes-train.txt.template
conf/naivebayes-wordlist.txt.template
{noformat}
As templates they get only instantiated/copied once, and are not overwritten if 
a user changes the *.txt in-place.
* anyway, it would be nice to catch all IO errors related to an incomplete 
configuration and show a nice error message, e.g., Failed to load 
naivebayes-train.txt configured in parsefilter.naivebayes.trainfile: ...

(#2) the commons-cli is properly installed in 
{{build/plugins/parsefilter-naivebayes/}} resp. in the corresponding runtime 
folder or job file: It should be enough to add all indirect dependencies to the 
plugin.xml

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605681#comment-14605681
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/40

NUTCH-2038

added all the jars in plugin.xml

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/40.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #40


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606348#comment-14606348
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/41


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606592#comment-14606592
 ] 

Asitang Mishra commented on NUTCH-2038:
---

Hi [~wastl-nagel],

I am facing the following issue when running in local (please test the latest 
pull for this). This I even faced in the pull #40 here. Please test and see if 
you are facing it too.
I have added all the dependencies, dont seem to understand why it's still givin 
class not found!!!

java.lang.Exception: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.RuntimeException: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:857)
at 
org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:718)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: 
org.apache.mahout.vectorizer.document.SequenceFileTokenizerMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:372)
at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:360)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:340)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:810)
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:855)
... 9 more
2015-06-29 15:45:05,038 ERROR naivebayes.NaiveBayesParseFilter - Error occured 
while training:: java.lang.IllegalStateException: Job failed!
at 
org.apache.mahout.vectorizer.DocumentProcessor.tokenizeDocuments(DocumentProcessor.java:95)
at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.run(SparseVectorsFromSequenceFiles.java:257)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at 
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles.main(SparseVectorsFromSequenceFiles.java:56)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:90)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:160)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at 
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
at 
org.apache.nutch.parse.HtmlParseFilters.init(HtmlParseFilters.java:35)
at org.apache.nutch.parse.html.HtmlParser.setConf(HtmlParser.java:343)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:78)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:104)
at org.apache.nutch.parse.ParseSegment.map(ParseSegment.java:46)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:430)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:366)
at 
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


 

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606359#comment-14606359
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/42

NUTCH-2038

minor changes and suggestions by Sebastian.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/42.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #42


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038

commit 8f45e634c942df66ea9c1ee775bb216d35fabb87
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:23:39Z

patch 6.0 for NUTCH-2038

commit 97278c5a09f5d4391473185d2268c7b26f151120
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:24:34Z

patch 6.0 for NUTCH-2038

commit 9b876bc8cbad902b094d696e3df751d9f163e4b3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:25:06Z

patch 6.0 for NUTCH-2038

commit 866486e6be337e8a1e0e5209642649a1834278d3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:35:24Z

patch 6.1 for NUTCH-2038

commit dd159175822a476cc5889da71a19272cf733e011
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T20:39:55Z

patch 6.2 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606228#comment-14606228
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/40


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-29 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606234#comment-14606234
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/41

NUTCH-2038

--added specific IOException messages
--added files: 
conf/naivebayes-train.txt.template
conf/naivebayes-wordlist.txt.template

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/41.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #41


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038

commit a9465c06d59e7ed2bd13d07c128bcea574fc9d6c
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T14:27:02Z

Patch 5.2 for NUTCH-2038

commit 8f45e634c942df66ea9c1ee775bb216d35fabb87
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:23:39Z

patch 6.0 for NUTCH-2038

commit 97278c5a09f5d4391473185d2268c7b26f151120
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:24:34Z

patch 6.0 for NUTCH-2038

commit 9b876bc8cbad902b094d696e3df751d9f163e4b3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:25:06Z

patch 6.0 for NUTCH-2038

commit 866486e6be337e8a1e0e5209642649a1834278d3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T19:35:24Z

patch 6.1 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604998#comment-14604998
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/37


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14604999#comment-14604999
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/38

NUTCH-2038

Made new changes suggested and corrections by Sebastian.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/38.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #38


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605088#comment-14605088
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33432911
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
parsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
parsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text +  ::
+  + StringUtils.stringifyException(e));
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model file already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+  

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605087#comment-14605087
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33432889
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
parsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
parsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
--- End diff --

I asked for this to be removed.


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605169#comment-14605169
 ] 

Hudson commented on NUTCH-2038:
---

FAILURE: Integrated in Nutch-trunk #3181 (See 
[https://builds.apache.org/job/Nutch-trunk/3181/])
fix for NUTCH-2038: Naive Bayes classifier based html Parse filter (for 
filtering outlinks) contributed by Asitang Mishra asit...@gmail.com this 
closes #39 (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688085)
* /nutch/trunk/CHANGES.txt
fix for NUTCH-2038: Naive Bayes classifier based html Parse filter (for 
filtering outlinks) contributed by Asitang Mishra asit...@gmail.com this 
closes #39 (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1688084)
* /nutch/trunk/.gitignore
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/ivy/ivy.xml
* /nutch/trunk/src/plugin/build.xml
* /nutch/trunk/src/plugin/parsefilter-naivebayes
* /nutch/trunk/src/plugin/parsefilter-naivebayes/build.xml
* /nutch/trunk/src/plugin/parsefilter-naivebayes/ivy.xml
* /nutch/trunk/src/plugin/parsefilter-naivebayes/plugin.xml
* /nutch/trunk/src/plugin/parsefilter-naivebayes/src
* /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java
* /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org
* /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache
* /nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesClassifier.java
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
* 
/nutch/trunk/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/package-info.java


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605099#comment-14605099
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/39

NUTCH-2038

Removed the TODO comments

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/39.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #39


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038

commit 5e907b1109c8e623bfcdb25b4b467dd53fbec9f3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:51:58Z

Patch 5.0 for NUTCH-2038

commit b984cdfac2d30ef38b1aebbc0330ba7eee1e12bf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:22Z

Patch 5.0 for NUTCH-2038

commit ecbd4c27ae71b8c04e011c6b7106cc1fb324e04a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-28T23:53:52Z

Patch 5.0 for NUTCH-2038

commit aba64fc941ed7616153d19410dbe9b9a0f8ef387
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T00:03:43Z

Patch 5.0 for NUTCH-2038

commit 71be15df81222adc6b58b6308e1dac7db23b6386
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-29T04:21:38Z

Patch 5.1 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605093#comment-14605093
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang commented on a diff in the pull request:

https://github.com/apache/nutch/pull/38#discussion_r33433090
  
--- Diff: 
src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java
 ---
@@ -0,0 +1,204 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parsefilter.naivebayes;
+
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.DocumentFragment;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.apache.hadoop.util.StringUtils;
+import org.apache.nutch.parse.HTMLMetaTags;
+import org.apache.nutch.parse.HtmlParseFilter;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseData;
+import org.apache.nutch.parse.ParseResult;
+import org.apache.nutch.parse.ParseStatus;
+import org.apache.nutch.parse.ParseText;
+import org.apache.nutch.protocol.Content;
+
+import java.io.Reader;
+import java.io.BufferedReader;
+import java.io.IOException;
+import java.util.ArrayList;
+
+/**
+ * Html Parse filter that classifies the outlinks from the parseresult as
+ * relevant or irrelevant based on the parseText's relevancy (using a 
training
+ * file where you can give positive and negative example texts see the
+ * description of parsefilter.naivebayes.trainfile) and if found 
irrelevent it
+ * gives the link a second chance if it contains any of the words from the 
list
+ * given in parsefilter.naivebayes.wordlist. CAUTION: Set the 
parser.timeout to
+ * -1 or a bigger value than 30, when using this classifier.
+ */
+public class NaiveBayesParseFilter implements HtmlParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+  .getLogger(NaiveBayesParseFilter.class);
+
+  public static final String TRAINFILE_MODELFILTER = 
parsefilter.naivebayes.trainfile;
+  public static final String DICTFILE_MODELFILTER = 
parsefilter.naivebayes.wordlist;
+
+  private Configuration conf;
+  private String inputFilePath;
+  private String dictionaryFile;
+  private ArrayListString wordlist = new ArrayListString();
+
+  public boolean filterParse(String text) {
+
+try {
+  return classify(text);
+} catch (IOException e) {
+  // TODO Auto-generated catch block
+  LOG.error(Error occured while classifying::  + text +  ::
+  + StringUtils.stringifyException(e));
+}
+
+return false;
+  }
+
+  public boolean filterUrl(String url) {
+
+return containsWord(url, wordlist);
+
+  }
+
+  public boolean classify(String text) throws IOException {
+
+// if classified as relevent 1 then return true
+if (NaiveBayesClassifier.classify(text).equals(1))
+  return true;
+return false;
+  }
+
+  public void train() throws Exception {
+// check if the model file exists, if it does then don't train
+if (!FileSystem.get(conf).exists(new Path(model))) {
+  LOG.info(Training the Naive Bayes Model);
+  NaiveBayesClassifier.createModel(inputFilePath);
+} else {
+  LOG.info(Model file already exists. Skipping training.);
+}
+  }
+
+  public boolean containsWord(String url, ArrayListString wordlist) {
+for (String word : wordlist) {
+  if (url.contains(word)) {
+return true;
+  }
+}
+
+return false;
+  }
+
+  public void setConf(Configuration conf) {
+this.conf = conf;
+inputFilePath = conf.get(TRAINFILE_MODELFILTER);
+

[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605098#comment-14605098
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/38


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605132#comment-14605132
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/39


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-28 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605120#comment-14605120
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

Tests fail in TestParserFactory:

{noformat}
org/apache/commons/cli2/Option
java.lang.NoClassDefFoundError: org/apache/commons/cli2/Option
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesClassifier.createModel(NaiveBayesClassifier.java:105)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:93)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:148)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at 
org.apache.nutch.plugin.PluginRepository.getOrderedPlugins(PluginRepository.java:441)
at 
org.apache.nutch.parse.HtmlParseFilters.init(HtmlParseFilters.java:34)
at org.apache.nutch.parse.tika.TikaParser.setConf(TikaParser.java:244)
at 
org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:163)
at 
org.apache.nutch.parse.ParserFactory.getParsers(ParserFactory.java:136)
at 
org.apache.nutch.parse.TestParserFactory.testGetParsers(TestParserFactory.java:63)
{noformat}

I'll fix it.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602553#comment-14602553
 ] 

Sebastian Nagel commented on NUTCH-2038:


Great [~asitangm]! I'll tried to run it via parsechecker and also within a 
small crawl.
* there is still one {{e.printStackTrace();}} :)
* if the plugin is activated in plugin.included but not configured:
{noformat}
2015-06-26 09:33:24,174 ERROR naivebayes.NaiveBayesParseFilter - ParseFilter: 
NaiveBayes: trainfile or wordlist not set in the 
parsefilte.naivebayes.trainfile or parsefilte.naivebayes.wordlist
2015-06-26 09:33:24,175 WARN  parse.ParseSegment - Error parsing: 
file:/home/wastl/work/websearch/crawler/nutch/src/plugin/parse-exorbyte/sample/subdocuments1-html5.html:
 java.lang.IllegalArgumentException: ParseFilter: NaiveBayes: trainfile or 
wordlist not set in the parsefilte.naivebayes.trainfile or 
parsefilte.naivebayes.wordlist
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:120)
{noformat}
A plugin propagated in the description of plugin.includes should optimally work 
out-of-the-box. You could add train/word file templates to conf/ containing a 
few trivial ham/spam examples. They are then instantiated and installed into 
runtime/ and users could just modify them.
* there should be a clear error message if a configured file fails to load 
(e.g., Failed to load naivebayes-train.txt configured in 
parsefilter.naivebayes.trainfile: ...) instead of
{noformat}
Exception in thread main java.lang.NullPointerException
at java.io.Reader.init(Reader.java:78)
at java.io.BufferedReader.init(BufferedReader.java:94)
at java.io.BufferedReader.init(BufferedReader.java:109)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:129)
{noformat}
* finally, the JobRunner crashed with::
{noformat}
2015-06-26 09:48:50,762 INFO  naivebayes.NaiveBayesParseFilter - Training the 
Naive Bayes Model
2015-06-26 09:48:50,764 WARN  mapred.LocalJobRunner - job_local1978281032_0001
java.lang.Exception: java.lang.NoClassDefFoundError: 
org/apache/lucene/analysis/Analyzer
at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:354)
Caused by: java.lang.NoClassDefFoundError: org/apache/lucene/analysis/Analyzer
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.train(NaiveBayesParseFilter.java:94)
at 
org.apache.nutch.parsefilter.naivebayes.NaiveBayesParseFilter.setConf(NaiveBayesParseFilter.java:142)
{noformat}
 That's probably caused because the dependencies are not listed in the 
plugin.xml.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-26 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603977#comment-14603977
 ] 

Asitang Mishra commented on NUTCH-2038:
---

Oh Great will fix them all :)

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600788#comment-14600788
 ] 

Sebastian Nagel commented on NUTCH-2038:


Great, thanks! Ideally the model is loaded from setConf() which is executed 
once per job/task.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14601499#comment-14601499
 ] 

Chris A. Mattmann commented on NUTCH-2038:
--

OK so it looks like the latest patch is commitable. Asitang is addressing Seb's 
comments. I will go ahead and get this committed later today after I see the 
final PR/patch. Great work.

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602110#comment-14602110
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

GitHub user asitang opened a pull request:

https://github.com/apache/nutch/pull/37

NUTCH-2038

Made changes suggested by Sebastial Nagel.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/asitang/nutch NUTCH-2038

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/37.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #37


commit b0ce4a157dbd0bfd8ea368f3fa230a90c7117ae2
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:11:42Z

patch 1.0 for NUTCH-2038

commit e243cc5e626106a4cd8dfca8d9c2ec93e9648560
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:14:37Z

patch 1.0 for NUTCH-2038

commit 711f44d8d4af51538ff1764145ac743445b6f43b
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-17T16:35:28Z

patch 1.0 for NUTCH-2038

commit e0e924e15c247d3fa3dd92f387fe53ba7effd78a
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-18T15:09:30Z

final commir for pattch 1.0

commit cca768bc1c790a976594136433485fe899465cb8
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-19T20:13:34Z

Patch 2.0 for NUTCH-2038

commit 0e80bf471b7d40965cf3bdad908252f5ce577d85
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:45:50Z

commit for 3.0 patch of NUTCH-2038

commit 63efcfecd2eda339c3c55a6236cb88c7a08698bc
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:46:46Z

commit for 3.0 patch of NUTCH-2038

commit 3a7bf466c76e8cffef96063101a39a77c328d657
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:55:22Z

commit for 3.1 patch of NUTCH-2038

commit ae89456e9f4078111653273fe0ac52c26c568c36
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T15:58:12Z

commit for 3.2 patch of NUTCH-2038

commit ae639ec40263fafbd6c0273c619d425ee482f7f0
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-24T17:31:09Z

commit for 3.3 patch of NUTCH-2038

commit 5ba14790c1367deeb54d4d61f87be3d602cecedf
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T22:59:45Z

patch 4.0 for NUTCH-2038

commit 4b5597a5fac0d3d94a38aace9b8a386d956da4e3
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:00:40Z

patch 4.0 for NUTCH-2038

commit 9ebcae33284d325f86bdbcfa18ef2c9a5744e67d
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:05:20Z

patch 4.1 for NUTCH-2038

commit 830f05bfe77abf79b2877c2a9c388fa24b3df526
Author: Asitang Mishra asit...@gmail.com
Date:   2015-06-25T23:07:44Z

patch 4.1 for NUTCH-2038




 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14602108#comment-14602108
 ] 

ASF GitHub Bot commented on NUTCH-2038:
---

Github user asitang closed the pull request at:

https://github.com/apache/nutch/pull/36


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600269#comment-14600269
 ] 

Sebastian Nagel commented on NUTCH-2038:


Hi [~asitang], the latest pull request #36 looks good.
- maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482
- is this statement still true?
bq. CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using 
this classifier.
- afaics, the way the model is generated, stored and loaded needs a review:
-* it should be read/generated once and then cached in memory,
-* writing the model to disk is likely to become painful in distributed mode 
with concurrent tasks.
- cosmetics:
-* exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
-* code formatting, see 
[[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer#Step_Three:_Using_the_JIRA_and_Developing]]

 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2038) Naive Bayes classifier based html Parse filter (for filtering outlinks)

2015-06-24 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600419#comment-14600419
 ] 

Asitang Mishra commented on NUTCH-2038:
---


maybe rename the plugin to parsefilter-naivebayes for simplicity and in 
advance of NUTCH-1482

Will do that



is this statement still true?

CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when 
using this classifier.

The first ever call to parse filter takes a bit more time because the training 
is done and model is created. So, time out should be a little more. Does not 
take much time after this.




afaics, the way the model is generated, stored and loaded needs a review:
it should be read/generated once and then cached in memory,
writing the model to disk is likely to become painful in distributed 
mode with concurrent tasks.

The model is created during the parsing of the first fetched page of the very 
first parse job. After that it checks if the model file already present or not.
The model file is being read each time the classify() function is called, will 
change that and store the model all the way thru for a single parse job.

cosmetics:
exceptions are properly logged via 
LOG.error(StringUtils.stringifyException(e)) and do not get lost somewhere in 
stdout/stderr as of e.printStackTrace()
code formatting, see [1]

will do that


 Naive Bayes classifier based html Parse filter (for filtering outlinks)
 ---

 Key: NUTCH-2038
 URL: https://issues.apache.org/jira/browse/NUTCH-2038
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, injector, parser
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, nutch
 Fix For: 1.11


 A html parse filter that will filter out the outlinks in two stages. 
 Classify the parse text and decide if the parent page is relevant. If 
 relevant then don't filter the outlinks. If irrelevant then go thru each 
 outlink and see if the url contains any of the important words from a list. 
 If it does then let it pass.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)