Re: nucth and mahout integration
We wrote a custom Nutch parse plugin that uses a Mahout classifier to classify docs. Mathijs Homminga On Jul 1, 2012, at 21:02, Alexander Aristov alexander.aris...@gmail.com wrote: People can you give me some advises? I want to integrate nutch and mahout to classify crawled pages. 1st question: Has someone tried this and are there any libraries available? next: What is better/easier? Improve nutch and inject mahout classifier into the project OR improve mahout to add an ability to read and write nutch files? Best Regards Alexander Aristov
Re: nucth and mahout integration
Alexander, can you give me some advises? I want to integrate nutch and mahout to classify crawled pages. 1st question: Has someone tried this and are there any libraries available? https://github.com/DigitalPebble/behemoth could be used to do Nutch - Behemoth - Mahout. The only problem is that there is no standard format for the Mahout classifiers so you would need to write a bit of code for it. There is also a SOLR plugin in Behemoth Alternatively you can use out Text Classification API ( https://github.com/DigitalPebble/TextClassification) within a Nutch indexing filter. next: What is better/easier? Improve nutch and inject mahout classifier into the project OR improve mahout to add an ability to read and write nutch files? Depends on what you need to do with the data after classification. Behemoth already does the conversion from Nutch to Mahout but again the problem is the lack of standard on the Mahout side. HTH -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
[jira] [Updated] (NUTCH-1087) Deprecate crawl command and replace with example script
[ https://issues.apache.org/jira/browse/NUTCH-1087?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Nioche updated NUTCH-1087: - Attachment: crawl WORK IN PROGRESS Need to add more comments + include the injection, linkd and SOLR steps The rest of the script should be fine and should provide a good basis. Deprecate crawl command and replace with example script --- Key: NUTCH-1087 URL: https://issues.apache.org/jira/browse/NUTCH-1087 Project: Nutch Issue Type: Task Affects Versions: 1.4 Reporter: Markus Jelsma Priority: Minor Fix For: 1.6 Attachments: crawl * remove the crawl command * add basic crawl shell script See thread: http://www.mail-archive.com/dev@nutch.apache.org/msg03848.html -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Add me to the Mailing list
http://nutch.apache.org/mailing_lists.html#Developers On Sun, Jul 1, 2012 at 3:48 PM, michael F mich...@bionic8.com wrote: -- Lewis
[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405038#comment-13405038 ] Lewis John McGibbney commented on NUTCH-1415: - Hi Sebastian. I will be pushing the 1.5.1 RC today. I'll test this an commit today if all is good. Thanks for this. Lewis release packages to contain top level folder apache-nutch-x.x - Key: NUTCH-1415 URL: https://issues.apache.org/jira/browse/NUTCH-1415 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6, 1.5.1 Reporter: Sebastian Nagel Priority: Minor Attachments: NUTCH-1415.patch The release packages should contain a top level folder named apache-nutch-x.x (x replaced by major and minor version) as in previous releases. Unpacking the packages from the command line via tar xvfz package.tar.gz or unzip package.zip should place all files in that folder. Cf. discussions on mailing lists: * http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E * http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: Nutch Author, Publication, and Religion Detection
OK so please let us know how you get on. Although you seem to have a clear idea about how you're going to progress with the issue, I would seriously consider taking on board Julien's comments and grabbing the code that he's made available for similar tasks. All the best Lewis On Fri, Jun 29, 2012 at 7:19 PM, JAB george.garn...@baesystems.com wrote: Hi Lewis; 'm looking at creating Nutch plugin to determine if a document is an article on religion, and what religion its primarily talking about. Then, adding an annotation called 'religion' to the document on what the primary category of the religion is. Examples: Atheism, Buddhism , Christian, Hindu, Jewish, Muslim, or Unknown (if it can't be determined). No annotation will be added if its not an article on religion. Next, another annotation on what sub-category the religion is. For example, under Christian would be Catholic or Protestant. Then possibly a third annotation for the denomination. Examples of denomination: 'Baptist Bible Churches' or 'Christian Methodist Episcopal Church' ( have a list of 147 denominations). I'm not familiar with religious breakdowns so I don't know if this it the appropriate way to categorize them. ** Design: I created a java class on religion that extends IndexingFilter class. I next determine if its an article on religion. I do so by counting the number of occurrences of certain key words in the document. Example, if 'God' appears more then 10 times, its an article on religion. If it mentions 'Christian' more than a certain number of times and more often than other religions, the sub-category would be 'Christian'. The first match on denomination search would be assumed to be the denomination. I'm also using a language-detection plugin (http://developer.cybozu.co.jp/oss/2010/10/language-detect.html) to determine the language of the document so I can search for words in the document's native language. I don't know if this is the best approach to solving this issue. -- View this message in context: http://lucene.472066.n3.nabble.com/Nutch-Author-Publication-and-Religion-Detection-tp3991662p3992130.html Sent from the Nutch - Dev mailing list archive at Nabble.com. -- Lewis
[jira] [Commented] (NUTCH-1415) release packages to contain top level folder apache-nutch-x.x
[ https://issues.apache.org/jira/browse/NUTCH-1415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405163#comment-13405163 ] Lewis John McGibbney commented on NUTCH-1415: - Hi Sebastian. Having committed this to the 1.5.1 branch and subsequent RC#2 tag, the tar-src and zip-src artifacts seems to be fine, however tar-bin and zip-bin are not and still fail to produce the apache-nutch-x.x top level folder within the generated artifacts. I wonder if you could check this out for me at your eariler connvenience as we are very very close to generating a good 1.5.1 RC#2 when this is complete. Thanks in advnace Lewis release packages to contain top level folder apache-nutch-x.x - Key: NUTCH-1415 URL: https://issues.apache.org/jira/browse/NUTCH-1415 Project: Nutch Issue Type: Bug Affects Versions: nutchgora, 1.6, 1.5.1 Reporter: Sebastian Nagel Priority: Minor Attachments: NUTCH-1415.patch The release packages should contain a top level folder named apache-nutch-x.x (x replaced by major and minor version) as in previous releases. Unpacking the packages from the command line via tar xvfz package.tar.gz or unzip package.zip should place all files in that folder. Cf. discussions on mailing lists: * http://mail-archives.apache.org/mod_mbox/nutch-dev/201205.mbox/%3c4fbd613f.1020...@googlemail.com%3E * http://mail-archives.apache.org/mod_mbox/nutch-user/201206.mbox/%3czarafa.4fe9e41c.2e51.6a20afee54fe4...@mail.openindex.io%3E -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Anyone else for this RC? I've been slighyl distracted with a number of things recently and only just getting round to following this one up so apologies about that. Best Lewis On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: +1 Crawling with HBaseStore works from injecting to indexing. Great work Lewis. On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for the Apache Nutch 2.0 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-2.0rc3 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3 We release Nutch 2.0 in this fashion due to the inclusion of Apache Gora and the likelihood that users will regularly recompile the code to suit dynamic requirements. Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-275 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis -- Lewis
[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405235#comment-13405235 ] Markus Jelsma commented on NUTCH-1418: -- There is no problem crawling Wikipedia indeed. Anyway, the warning is fine and the undecoded path is being added to the rule set. Perhaps the path should be skipped, if it cannot be decoded there's no need in storing it in the rule set, is there? error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ Key: NUTCH-1418 URL: https://issues.apache.org/jira/browse/NUTCH-1418 Project: Nutch Issue Type: Bug Affects Versions: 1.4 Reporter: Arijit Mukherjee Since learning that nutch will be unable to crawl the javascript function calls in href, I started looking for other alternatives. I decided to crawl http://en.wikipedia.org/wiki/Districts_of_India. I first tried injecting this URL and follow the step-by-step approach till fetcher - when I realized, nutch did not fetch anything from this website. I tried looking into logs/hadoop.log and found the following 3 lines - which I believe could be saying that nutch is unable to parse the robots.txt in the website and ttherefore, fetcher stopped? 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia_talk%3Mediation_Committee/ 2012-07-02 16:41:07,452 WARN api.RobotRulesParser - error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Cabal/Cases/ I tried checking the URL using parsechecker and no issues there! I think it means that the robots.txt is malformed for this website, which is preventing fetcher from fetching anything. Is there a way to get around this problem, as parsechecker seems to go on its merry way parsing. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
Will definitely have a look tomorrow Thanks On 2 July 2012 18:49, Lewis John Mcgibbney lewis.mcgibb...@gmail.comwrote: Anyone else for this RC? I've been slighyl distracted with a number of things recently and only just getting round to following this one up so apologies about that. Best Lewis On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: +1 Crawling with HBaseStore works from injecting to indexing. Great work Lewis. On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for the Apache Nutch 2.0 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-2.0rc3 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3 We release Nutch 2.0 in this fashion due to the inclusion of Apache Gora and the likelihood that users will regularly recompile the code to suit dynamic requirements. Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-275 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis -- Lewis -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
Re: [VOTE] Apache Nutch 2.0 Release Candidate #3
I'll try to scope this by tomorrow...thanks Lewis. Cheers, Chris On Jul 2, 2012, at 10:49 AM, Lewis John Mcgibbney wrote: Anyone else for this RC? I've been slighyl distracted with a number of things recently and only just getting round to following this one up so apologies about that. Best Lewis On Wed, Jun 27, 2012 at 10:23 AM, Ferdy Galema ferdy.gal...@kalooga.com wrote: +1 Crawling with HBaseStore works from injecting to indexing. Great work Lewis. On Mon, Jun 25, 2012 at 6:32 PM, Lewis John Mcgibbney lewis.mcgibb...@gmail.com wrote: Hi Everyone, A candidate for the Apache Nutch 2.0 RC3 is available at: http://people.apache.org/~lewismc/apache-nutch-2.0rc3 The release candidate is a src.zip and src.tar.gz ONLY archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.0rc3 We release Nutch 2.0 in this fashion due to the inclusion of Apache Gora and the likelihood that users will regularly recompile the code to suit dynamic requirements. Further, a staged Maven repository of the 2.0 jar, sources.jar and javadoc.jar is available here: https://repository.apache.org/content/repositories/orgapachenutch-275 Please vote on releasing this package as Apache Nutch 2.0. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 2.0 [ ] -1 Do not release this package because... Many Thanks and heres to plenty more. Kind Regards, Lewis P.S. Here's my +1. -- Lewis -- Lewis ++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Commented] (NUTCH-1418) error parsing robots rules- can't decode path: /wiki/Wikipedia%3Mediation_Committee/
[ https://issues.apache.org/jira/browse/NUTCH-1418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13405639#comment-13405639 ] Arijit Mukherjee commented on NUTCH-1418: - Hi, I have seen that the urls mentioned in the url http://en.wikipedia.org/wiki/Districts_of_India are not picked up in the fetch/parse process into outlinks. However, the parsechecker is able to pick all the links into outlink. On looking through the hadoop.log, I concluded that this is the only issue in fetch - and thereafter fetch bails out. So, I believe that fetch bails out on seeing this WARN. I have copy-pasted the contents of my hadoop.log - which contains the log from fetch (where the WARN occurs) as well as the log from parsechecker. =hadoop.log= 2012-07-02 16:40:35,300 INFO crawl.Injector - Injector: starting at 2012-07-02 16:40:35 2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: crawlDb: /root/arijit/crawler/crawl/crawldb 2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: urlDir: /root/arijit/crawler/urls 2012-07-02 16:40:35,301 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2012-07-02 16:40:35,863 INFO plugin.PluginRepository - Plugins: looking in: /root/arijit/apache-nutch-1.4-bin/runtime/local/plugins 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Registered Plugins: 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2012-07-02 16:40:35,993 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Registered Extension-Points: 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2012-07-02 16:40:35,994 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2012-07-02 16:40:36,070 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2012-07-02 16:40:36,696 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2012-07-02 16:40:36,999 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-07-02 16:40:37,880 INFO crawl.Injector - Injector: finished at 2012-07-02 16:40:37, elapsed: 00:00:02 2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: starting at 2012-07-02 16:40:41 2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-07-02 16:40:41,619 INFO crawl.Generator - Generator: filtering: true 2012-07-02 16:40:41,620