What I was missing when first started with Nutch, and one can claim that a little research would of solved it, was how to configure nutch-site.xml, when looking at the NutchTutorial you can't be sure what applies to Nutch 2.x and what doesn't without prior knowledge that the nutch-site.xml is the same.
Specifically what I was missing is the fact that I should setup http.agent.name, http.robots.agents, plugin.folders and plugin.includes and the fact that setting parser.timeout and solr.commit.size will help a lot in debugging. And the fact that I should increase the granularity of the logs. It seams obvious in retrospect but when you're making you're first steps you feel a little lost. Perhaps a simple "view NutchTutorial for nutch-site.xml configuration" is enough. On Thu, Jan 23, 2014 at 8:24 PM, Tejas Patil <[email protected]>wrote: > On Thu, Jan 23, 2014 at 1:36 PM, d_k <[email protected]> wrote: > >> My main concerns with the Nutch2Tutorial was that it didn't stand by >> itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with >> suspicion because I didn't know what is relevant for Nutch 2 and what isn't. >> And the Nutch2Tutorial tutorial alone is not enough to get you going. >> >> I think this can be addressed by creating a single page or perhaps >> several pages that together cover everything you need to perform a basic >> crawl: >> >> [*] Configuring the data store >> [**] HBase >> [**] Cassandra >> > [*] General nutch 2 client configuration that are relevant to any store > > [1] : http://wiki.apache.org/nutch/Nutch2Tutorial > [2] : http://wiki.apache.org/nutch/Nutch2Cassandra > > >> [**] MySQL >> > > Is now not supported in Gora and new Nutch versions so no wiki page for > it. > >> >> [*] Crawling >> [**] Crawling step by step (running each step seperatly) >> [**] Performing a full crawl >> [***] using the crawl script >> [***] using the job file >> > > The commands are same as 1.X. The only change needed would be for > arguments which can be traced looking at the command usage. > > The notion of having everything in one place would make things neat. > AFAIK, the reason why this was not done before was maintenance overhead. If > you want to create such a page, feel free to add the same. You would need > to create a login to nutch wiki. If there are issues with that, then just > share the document in text format and I would add it to nutch wiki. > > ~tejas > >> >> >> >> >> On Wed, Jan 22, 2014 at 1:53 PM, Julien Nioche < >> [email protected]> wrote: >> >>> Thanks Tejas! >>> >>> >>> On 22 January 2014 11:51, Tejas Patil <[email protected]> wrote: >>> >>>> Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to >>>> "Archive and Legacy". >>>> >>>> ~tejas >>>> >>>> >>>> On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil >>>> <[email protected]>wrote: >>>> >>>>> Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial" >>>>> wiki page [0]. I would soon remove the old nutchhadooptutorial page >>>>> from wiki. >>>>> >>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial >>>>> >>>>> *@d_k*, there are already tutorials for running Nutch 2.x. See [1] >>>>> and [2]. Those are not as extensive as the tutorial for 1.x [3] but carry >>>>> the steps which are different for 2.x. The rest steps after datastore >>>>> setup >>>>> are similar - the only difference being in the command params which can be >>>>> figured out from the usage and so they were not duplicated in those 2.x >>>>> tutorials to avoid maintenance overhead. Do you think that the 2.x >>>>> tutorials are inadequate in some regards ? >>>>> >>>>> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial >>>>> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra >>>>> [3] : http://wiki.apache.org/nutch/NutchTutorial >>>>> >>>>> Thanks, >>>>> Tejas >>>>> >>>>> >>>>> On Wed, Jan 22, 2014 at 2:47 AM, d_k <[email protected]> wrote: >>>>> >>>>>> Actually what I would like to see is a Nutch 2.x tutorial at the same >>>>>> level of detail as the >>>>>> http://wiki.apache.org/nutch/NutchHadoopTutorial >>>>>> What is the process of contributing to that wiki page? >>>>>> >>>>>> >>>>>> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi >>>>>>> >>>>>>> The whole thing has been replaced with >>>>>>> >>>>>>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which >>>>>>> does exactly what you described. +1 to remove the old >>>>>>> nutchhadooptutorial page >>>>>>> >>>>>>> J. >>>>>>> >>>>>>> >>>>>>> On 21 January 2014 17:44, Tejas Patil <[email protected]>wrote: >>>>>>> >>>>>>>> Hi nutch-dev, >>>>>>>> >>>>>>>> I was looking at [0] and realized that with the massive number of >>>>>>>> Hadoop setup tutorials out there on internet, we need not repeat the >>>>>>>> same >>>>>>>> on nutch wiki page and instead assume that user has already done Hadoop >>>>>>>> setup. For convinience, we could direct users to the Hadoop wiki page >>>>>>>> which >>>>>>>> has Hadoop setup details. >>>>>>>> Plus, I propose following: >>>>>>>> >>>>>>>> - Section "Downloading Hadoop and Nutch" : Remove the Hadoop >>>>>>>> portions and let the Nutch stuff stay. >>>>>>>> - Section "Setting Up The Deployment Architecture" must be removed. >>>>>>>> - Section "Deploy Nutch to Single Machine" and "Deploy Nutch to >>>>>>>> Multiple Machines" can be merged together. >>>>>>>> - Section "Performing a Nutch Crawl", "Testing the Crawl" and >>>>>>>> "Performing a Search" must be merged, its contents must be updated. >>>>>>>> - Section "Rsyncing Code to Slaves" and "Updates" can be completely >>>>>>>> removed. >>>>>>>> >>>>>>>> Any comments ? >>>>>>>> >>>>>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Tejas >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Open Source Solutions for Text Engineering >>>>>>> >>>>>>> http://digitalpebble.blogspot.com/ >>>>>>> http://www.digitalpebble.com >>>>>>> http://twitter.com/digitalpebble >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> >>> Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> http://twitter.com/digitalpebble >>> >> >> >

