Re: nucth and mahout integration

2012-07-02 Thread Mathijs Homminga
We wrote a custom Nutch parse plugin that uses a Mahout classifier to classify 
docs.

Mathijs Homminga

On Jul 1, 2012, at 21:02, Alexander Aristov alexander.aris...@gmail.com wrote:

 People
 
 can you give me some advises? 
 
 I want to integrate nutch and mahout to classify crawled pages. 
 
 1st question: Has someone tried this and are there any libraries available?
 
 next: What is better/easier? Improve nutch and inject mahout classifier into 
 the project OR improve mahout to add an ability to read and write nutch files?
 
 Best Regards
 Alexander Aristov


Re: nucth and mahout integration

2012-07-02 Thread Julien Nioche
Alexander,

can you give me some advises?

 I want to integrate nutch and mahout to classify crawled pages.

 1st question: Has someone tried this and are there any libraries available?


https://github.com/DigitalPebble/behemoth could be used to do Nutch -
Behemoth - Mahout. The only problem is that there is no standard format
for the Mahout classifiers so you would need to write a bit of code for it.
There is also a SOLR plugin in Behemoth

Alternatively you can use out Text Classification API (
https://github.com/DigitalPebble/TextClassification) within a Nutch
indexing filter.



 next: What is better/easier? Improve nutch and inject mahout classifier
 into the project OR improve mahout to add an ability to read and write
 nutch files?


Depends on what you need to do with the data after classification.
Behemoth already does the conversion from Nutch to Mahout but again the
problem is the lack of standard on the Mahout side.

HTH

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script

2012-07-02 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1087:
-

Attachment: crawl

WORK IN PROGRESS
Need to add more comments + include the injection, linkd and SOLR steps
The rest of the script should be fine and should provide a good basis.


 Deprecate crawl command and replace with example script
 ---

 Key: NUTCH-1087
 URL: https://issues.apache.org/jira/browse/NUTCH-1087
 Project: Nutch
  Issue Type: Task
Affects Versions: 1.4
Reporter: Markus Jelsma
Priority: Minor
 Fix For: 1.6

 Attachments: crawl


 * remove the crawl command
 * add basic crawl shell script
 See thread:
 http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Add me to the Mailing list

2012-07-02 Thread Lewis John Mcgibbney
http://nutch.apache.org/mailing_lists.html#Developers

On Sun, Jul 1, 2012 at 3:48 PM, michael F mich...@bionic8.com wrote:




-- 
Lewis


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-07-02 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405038#comment-13405038
 ] 

Lewis John McGibbney commented on NUTCH-1415:
-

Hi Sebastian. I will be pushing the 1.5.1 RC today. I'll test this an commit 
today if all is good. Thanks for this. Lewis

 release packages to contain top level folder apache-nutch-x.x
 -

 Key: NUTCH-1415
 URL: https://issues.apache.org/jira/browse/NUTCH-1415
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6, 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Attachments: NUTCH-1415.patch


 The release packages should contain a top level folder named apache-nutch-x.x 
 (x replaced by major and minor version) as in previous releases. Unpacking 
 the packages from the command line via tar xvfz package.tar.gz or unzip 
 package.zip should place all files in that folder. Cf. discussions on mailing 
 lists:
 * 
 http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
 * 
 http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: Nutch Author, Publication, and Religion Detection

2012-07-02 Thread Lewis John Mcgibbney
OK so please let us know how you get on.

Although you seem to have a clear idea about how you're going to
progress with the issue, I would seriously consider taking on board
Julien's comments and grabbing the code that he's made available for
similar tasks.

All the best

Lewis

On Fri, Jun 29, 2012 at 7:19 PM, JAB george.garn...@baesystems.com wrote:
 Hi Lewis;

 'm looking at creating Nutch plugin to determine if a document is an article
 on religion, and what religion its primarily talking about. Then, adding an
 annotation called 'religion' to the document on what the primary category of
 the religion is. Examples: Atheism, Buddhism , Christian,  Hindu, Jewish,
 Muslim, or Unknown (if it can't be determined). No annotation will be added
 if its not an article on religion. Next, another annotation on what
 sub-category the religion is. For example, under Christian would be Catholic
 or Protestant. Then possibly a third annotation for  the denomination.
 Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist
 Episcopal Church' ( have a list of 147 denominations). I'm not familiar with
 religious breakdowns so I don't know if this it the appropriate way to
 categorize them.

 **
 Design:

 I created a java class on religion that extends IndexingFilter class. I next
 determine if its an article on religion. I do so by counting the number of
 occurrences of certain key words in the document. Example, if 'God' appears
 more then 10 times, its an article on religion. If it mentions 'Christian'
 more than a certain number of times and more often than other religions, the
 sub-category would be 'Christian'. The first match on denomination search
 would be assumed to be the  denomination. I'm also using a
 language-detection plugin
 (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to
 determine the language of the document so I can search for words in the
 document's native language. I don't know if this is the best approach to
 solving this issue.




 --
 View this message in context: 
 http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html
 Sent from the Nutch - Dev mailing list archive at Nabble.com.



-- 
Lewis


[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x

2012-07-02 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405163#comment-13405163
 ] 

Lewis John McGibbney commented on NUTCH-1415:
-

Hi Sebastian. Having committed this to the 1.5.1 branch and subsequent RC#2 
tag, the tar-src and zip-src artifacts seems to be fine, however tar-bin and 
zip-bin are not and still fail to produce the apache-nutch-x.x top level folder 
within the generated artifacts. I wonder if you could check this out for me at 
your eariler connvenience as we are very very close to generating a good 1.5.1 
RC#2 when this is complete. Thanks in advnace

Lewis 

 release packages to contain top level folder apache-nutch-x.x
 -

 Key: NUTCH-1415
 URL: https://issues.apache.org/jira/browse/NUTCH-1415
 Project: Nutch
  Issue Type: Bug
Affects Versions: nutchgora, 1.6, 1.5.1
Reporter: Sebastian Nagel
Priority: Minor
 Attachments: NUTCH-1415.patch


 The release packages should contain a top level folder named apache-nutch-x.x 
 (x replaced by major and minor version) as in previous releases. Unpacking 
 the packages from the command line via tar xvfz package.tar.gz or unzip 
 package.zip should place all files in that folder. Cf. discussions on mailing 
 lists:
 * 
 http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E
 * 
 http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-02 Thread Lewis John Mcgibbney
Anyone else for this RC?

I've been slighyl distracted with a number of things recently and only
just getting round to following this one up so apologies about that.

Best

Lewis

On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote:
 +1 Crawling with HBaseStore works from injecting to indexing.

 Great work Lewis.

 On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:

 Hi Everyone,

 A candidate for the Apache Nutch 2.0 RC3 is available at:

 http://people.apache.org/~lewismc/apache-nutch-2.0rc3

 The release candidate is a src.zip and src.tar.gz ONLY
 archive of the sources in:

 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3

 We release Nutch 2.0 in this fashion due to the inclusion of
 Apache Gora and the likelihood that users will regularly recompile
 the code to suit dynamic requirements.

 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:

 https://repository.apache.org/content/repositories/orgapachenutch-275

 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.

  [ ] +1 Release this package as Apache Nutch 2.0
  [ ] -1 Do not release this package because...

 Many Thanks and heres to plenty more.

 Kind Regards,
 Lewis

 P.S. Here's my +1.

 --
 Lewis





-- 
Lewis


[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

2012-07-02 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405235#comment-13405235
 ] 

Markus Jelsma commented on NUTCH-1418:
--

There is no problem crawling Wikipedia indeed. Anyway, the warning is fine and 
the undecoded path is being added to the rule set. Perhaps the path should be 
skipped, if it cannot be decoded there's no need in storing it in the rule set, 
is there?



 error parsing robots rules- can't decode path: 
 /wiki/Wikipedia%3Mediation_Committee/
 

 Key: NUTCH-1418
 URL: https://issues.apache.org/jira/browse/NUTCH-1418
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.4
Reporter: Arijit Mukherjee

 Since learning that nutch will be unable to crawl the javascript function 
 calls in href, I started looking for other alternatives. I decided to crawl 
 http://en.wikipedia.org/wiki/Districts_of_India.
 I first tried injecting this URL and follow the step-by-step approach 
 till fetcher - when I realized, nutch did not fetch anything from this 
 website. I tried looking into logs/hadoop.log and found the following 3 lines 
 - which I believe could be saying that nutch is unable to parse the 
 robots.txt in the website and ttherefore, fetcher stopped?

 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/
 2012-07-02 16:41:07,452 WARN  api.RobotRulesParser - error parsing robots 
 rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/
 I tried checking the URL using parsechecker and no issues there! I think 
 it means that the robots.txt is malformed for this website, which is 
 preventing fetcher from fetching anything. Is there a way to get around this 
 problem, as parsechecker seems to go on its merry way parsing.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-02 Thread Julien Nioche
Will definitely have a look tomorrow
Thanks

On 2 July 2012 18:49, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote:

 Anyone else for this RC?

 I've been slighyl distracted with a number of things recently and only
 just getting round to following this one up so apologies about that.

 Best

 Lewis

 On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com
 wrote:
  +1 Crawling with HBaseStore works from injecting to indexing.
 
  Great work Lewis.
 
  On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney
  lewis.mcgibb...@gmail.com wrote:
 
  Hi Everyone,
 
  A candidate for the Apache Nutch 2.0 RC3 is available at:
 
  http://people.apache.org/~lewismc/apache-nutch-2.0rc3
 
  The release candidate is a src.zip and src.tar.gz ONLY
  archive of the sources in:
 
  http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3
 
  We release Nutch 2.0 in this fashion due to the inclusion of
  Apache Gora and the likelihood that users will regularly recompile
  the code to suit dynamic requirements.
 
  Further, a staged Maven repository of the 2.0 jar, sources.jar and
  javadoc.jar is available here:
 
  https://repository.apache.org/content/repositories/orgapachenutch-275
 
  Please vote on releasing this package as Apache Nutch 2.0.
  The vote is open for the next 72 hours and passes if a majority of at
  least three +1 Nutch PMC votes are cast.
 
   [ ] +1 Release this package as Apache Nutch 2.0
   [ ] -1 Do not release this package because...
 
  Many Thanks and heres to plenty more.
 
  Kind Regards,
  Lewis
 
  P.S. Here's my +1.
 
  --
  Lewis
 
 



 --
 Lewis




-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble


Re: [VOTE] Apache Nutch 2.0 Release Candidate #3

2012-07-02 Thread Mattmann, Chris A (388J)
I'll try to scope this by tomorrow...thanks Lewis.

Cheers,
Chris

On Jul 2, 2012, at 10:49 AM, Lewis John Mcgibbney wrote:

 Anyone else for this RC?
 
 I've been slighyl distracted with a number of things recently and only
 just getting round to following this one up so apologies about that.
 
 Best
 
 Lewis
 
 On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com 
 wrote:
 +1 Crawling with HBaseStore works from injecting to indexing.
 
 Great work Lewis.
 
 On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney
 lewis.mcgibb...@gmail.com wrote:
 
 Hi Everyone,
 
 A candidate for the Apache Nutch 2.0 RC3 is available at:
 
 http://people.apache.org/~lewismc/apache-nutch-2.0rc3
 
 The release candidate is a src.zip and src.tar.gz ONLY
 archive of the sources in:
 
 http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3
 
 We release Nutch 2.0 in this fashion due to the inclusion of
 Apache Gora and the likelihood that users will regularly recompile
 the code to suit dynamic requirements.
 
 Further, a staged Maven repository of the 2.0 jar, sources.jar and
 javadoc.jar is available here:
 
 https://repository.apache.org/content/repositories/orgapachenutch-275
 
 Please vote on releasing this package as Apache Nutch 2.0.
 The vote is open for the next 72 hours and passes if a majority of at
 least three +1 Nutch PMC votes are cast.
 
 [ ] +1 Release this package as Apache Nutch 2.0
 [ ] -1 Do not release this package because...
 
 Many Thanks and heres to plenty more.
 
 Kind Regards,
 Lewis
 
 P.S. Here's my +1.
 
 --
 Lewis
 
 
 
 
 
 -- 
 Lewis


++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:   http://sunset.usc.edu/~mattmann/
++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++



[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/

2012-07-02 Thread Arijit Mukherjee (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405639#comment-13405639
 ] 

Arijit Mukherjee commented on NUTCH-1418:
-

Hi,
   I have seen that the urls mentioned in the url 
http://en.wikipedia.org/wiki/Districts_of_India are not picked up in the 
fetch/parse process into outlinks. However, the parsechecker is able to pick 
all the links into outlink. On looking through the hadoop.log, I concluded that 
this is the only issue in fetch - and thereafter fetch bails out. So, I believe 
that fetch bails out on seeing this WARN.
   I have copy-pasted the contents of my hadoop.log - which contains the log 
from fetch (where the WARN occurs) as well as the log from parsechecker.

=hadoop.log=
2012-07-02 16:40:35,300 INFO  crawl.Injector - Injector: starting at 2012-07-02 
16:40:35
2012-07-02 16:40:35,301 INFO  crawl.Injector - Injector: crawlDb: 
/root/arijit/crawler/crawl/crawldb
2012-07-02 16:40:35,301 INFO  crawl.Injector - Injector: urlDir: 
/root/arijit/crawler/urls
2012-07-02 16:40:35,301 INFO  crawl.Injector - Injector: Converting injected 
urls to crawl db entries.
2012-07-02 16:40:35,863 INFO  plugin.PluginRepository - Plugins: looking in: 
/root/arijit/apache-nutch-1.4-bin/runtime/local/plugins
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Plugin Auto-activation 
mode: [true]
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Registered Plugins:
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - the nutch core 
extension points (nutch-extensionpoints)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Basic URL 
Normalizer (urlnormalizer-basic)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Basic Indexing 
Filter (index-basic)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Html Parse 
Plug-in (parse-html)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - HTTP Framework 
(lib-http)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Pass-through 
URL Normalizer (urlnormalizer-pass)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Regex URL 
Filter (urlfilter-regex)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Http Protocol 
Plug-in (protocol-http)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Regex URL 
Normalizer (urlnormalizer-regex)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Tika Parser 
Plug-in (parse-tika)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - OPIC Scoring 
Plug-in (scoring-opic)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - CyberNeko HTML 
Parser (lib-nekohtml)
2012-07-02 16:40:35,993 INFO  plugin.PluginRepository - Anchor Indexing 
Filter (index-anchor)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - JavaScript 
Parser (parse-js)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Regex URL 
Filter Framework (lib-regex-filter)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Registered 
Extension-Points:
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch URL 
Normalizer (org.apache.nutch.net.URLNormalizer)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch Protocol 
(org.apache.nutch.protocol.Protocol)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch Segment 
Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch URL 
Filter (org.apache.nutch.net.URLFilter)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch Indexing 
Filter (org.apache.nutch.indexer.IndexingFilter)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - HTML Parse 
Filter (org.apache.nutch.parse.HtmlParseFilter)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch Content 
Parser (org.apache.nutch.parse.Parser)
2012-07-02 16:40:35,994 INFO  plugin.PluginRepository - Nutch Scoring 
(org.apache.nutch.scoring.ScoringFilter)
2012-07-02 16:40:36,070 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'inject', using default
2012-07-02 16:40:36,696 INFO  crawl.Injector - Injector: Merging injected urls 
into crawl db.
2012-07-02 16:40:36,999 WARN  util.NativeCodeLoader - Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
2012-07-02 16:40:37,880 INFO  crawl.Injector - Injector: finished at 2012-07-02 
16:40:37, elapsed: 00:00:02
2012-07-02 16:40:41,619 INFO  crawl.Generator - Generator: starting at 
2012-07-02 16:40:41
2012-07-02 16:40:41,619 INFO  crawl.Generator - Generator: Selecting 
best-scoring urls due for fetch.
2012-07-02 16:40:41,619 INFO  crawl.Generator - Generator: filtering: true
2012-07-02 16:40:41,620