What I was missing when first started with Nutch, and one can claim that a
little research would of solved it, was how to configure nutch-site.xml,
when looking at the NutchTutorial you can't be sure what applies to Nutch
2.x and what doesn't without prior knowledge that the nutch-site.xml is the
same.

Specifically what I was missing is the fact that I should setup
http.agent.name, http.robots.agents, plugin.folders and plugin.includes and
the fact that setting parser.timeout and solr.commit.size will help a lot
in debugging. And the fact that I should increase the granularity of the
logs.

It seams obvious in retrospect but when you're making you're first steps
you feel a little lost.

Perhaps a simple "view NutchTutorial for nutch-site.xml configuration" is
enough.



On Thu, Jan 23, 2014 at 8:24 PM, Tejas Patil <[email protected]>wrote:

> On Thu, Jan 23, 2014 at 1:36 PM, d_k <[email protected]> wrote:
>
>> My main concerns with the Nutch2Tutorial was that it didn't stand by
>> itself. As a newcomer to nutch I treated the NutchTutorial (for 1.x) with
>> suspicion because I didn't know what is relevant for Nutch 2 and what isn't.
>> And the Nutch2Tutorial tutorial alone is not enough to get you going.
>>
>> I think this can be addressed by creating a single page or perhaps
>> several pages that together cover everything you need to perform a basic
>> crawl:
>>
>> [*] Configuring the data store
>> [**] HBase
>> [**] Cassandra
>>
>    [*] General nutch 2 client configuration that are relevant to any store
>
> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra
>
>
>> [**] MySQL
>>
>
> Is now not supported in Gora and new Nutch versions so no wiki page for
> it.
>
>>
>> [*] Crawling
>> [**] Crawling step by step (running each step seperatly)
>> [**] Performing a full crawl
>> [***] using the crawl script
>> [***] using the job file
>>
>
> The commands are same as 1.X. The only change needed would be for
> arguments which can be traced looking at the command usage.
>
> The notion of having everything in one place would make things neat.
> AFAIK, the reason why this was not done before was maintenance overhead. If
> you want to create such a page, feel free to add the same. You would need
> to create a login to nutch wiki. If there are issues with that, then just
> share the document in text format and I would add it to nutch wiki.
>
> ~tejas
>
>>
>>
>>
>>
>> On Wed, Jan 22, 2014 at 1:53 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> Thanks Tejas!
>>>
>>>
>>> On 22 January 2014 11:51, Tejas Patil <[email protected]> wrote:
>>>
>>>> Moved the old nutchhadooptutorial page from Nutch wiki "Front page" to
>>>> "Archive and Legacy".
>>>>
>>>> ~tejas
>>>>
>>>>
>>>> On Wed, Jan 22, 2014 at 5:09 PM, Tejas Patil 
>>>> <[email protected]>wrote:
>>>>
>>>>> Thanks *Julien* for pointing me to new "NutchHadoopSingleNodeTutorial"
>>>>> wiki page [0]. I would soon remove the old nutchhadooptutorial page
>>>>> from wiki.
>>>>>
>>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial
>>>>>
>>>>> *@d_k*, there are already tutorials for running Nutch 2.x. See [1]
>>>>> and [2]. Those are not as extensive as the tutorial for 1.x [3] but carry
>>>>> the steps which are different for 2.x. The rest steps after datastore 
>>>>> setup
>>>>> are similar - the only difference being in the command params which can be
>>>>> figured out from the usage and so they were not duplicated in those 2.x
>>>>> tutorials to avoid maintenance overhead. Do you think that the 2.x
>>>>> tutorials are inadequate in some regards ?
>>>>>
>>>>> [1] : http://wiki.apache.org/nutch/Nutch2Tutorial
>>>>> [2] : http://wiki.apache.org/nutch/Nutch2Cassandra
>>>>> [3] : http://wiki.apache.org/nutch/NutchTutorial
>>>>>
>>>>> Thanks,
>>>>> Tejas
>>>>>
>>>>>
>>>>> On Wed, Jan 22, 2014 at 2:47 AM, d_k <[email protected]> wrote:
>>>>>
>>>>>> Actually what I would like to see is a Nutch 2.x tutorial at the same
>>>>>> level of detail as the
>>>>>> http://wiki.apache.org/nutch/NutchHadoopTutorial
>>>>>> What is the process of contributing to that wiki page?
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 9:33 PM, Julien Nioche <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> The whole thing has been replaced with
>>>>>>>  
>>>>>>> http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial<http://wiki.apache.org/nutch/NutchHadoopSingleNodeTutorial>which
>>>>>>>  does exactly what you described. +1 to remove the old
>>>>>>> nutchhadooptutorial page
>>>>>>>
>>>>>>> J.
>>>>>>>
>>>>>>>
>>>>>>> On 21 January 2014 17:44, Tejas Patil <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi nutch-dev,
>>>>>>>>
>>>>>>>> I was looking at [0] and realized that with the massive number of
>>>>>>>> Hadoop setup tutorials out there on internet, we need not repeat the 
>>>>>>>> same
>>>>>>>> on nutch wiki page and instead assume that user has already done Hadoop
>>>>>>>> setup. For convinience, we could direct users to the Hadoop wiki page 
>>>>>>>> which
>>>>>>>> has Hadoop setup details.
>>>>>>>> Plus, I propose following:
>>>>>>>>
>>>>>>>> - Section "Downloading Hadoop and Nutch" : Remove the Hadoop
>>>>>>>> portions and let the Nutch stuff stay.
>>>>>>>> - Section "Setting Up The Deployment Architecture" must be removed.
>>>>>>>> - Section "Deploy Nutch to Single Machine" and "Deploy Nutch to
>>>>>>>> Multiple Machines" can be merged together.
>>>>>>>> - Section "Performing a Nutch Crawl", "Testing the Crawl" and
>>>>>>>> "Performing a Search" must be merged, its contents must be updated.
>>>>>>>> - Section "Rsyncing Code to Slaves" and "Updates" can be completely
>>>>>>>> removed.
>>>>>>>>
>>>>>>>> Any comments ?
>>>>>>>>
>>>>>>>> [0] : http://wiki.apache.org/nutch/NutchHadoopTutorial
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Tejas
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Open Source Solutions for Text Engineering
>>>>>>>
>>>>>>> http://digitalpebble.blogspot.com/
>>>>>>> http://www.digitalpebble.com
>>>>>>> http://twitter.com/digitalpebble
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>> http://twitter.com/digitalpebble
>>>
>>
>>
>

Reply via email to