Tutorial%20on%20incremental%20crawling

Julien Nioche Mon, 28 Mar 2011 01:44:26 -0700

Hi Gabriele


>> you don't need to have 2 *and *3. The hadoop commands will work on the
>> local fs in a completely transparent way, it all depends on the way hadoop
>> is configured. It isolates the way data are stored (local or distrib) from
>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>> more confusion and lead beginners to think that they HAVE to use fs.
>>
>
> I apologize for not having yet looked into hadoop in detail but I had
> understood that it would abstract over the single machine fs.
>

No problems. It would be worth spending a bit of time reading about Hadoop
if you want to get a better understanding of Nutch. Tom White's book is an
excellent reference but the wikis and tutorials would be a good start



> However, to get up and running after downloading nutch will the script just
> work or will I have to configure hadoop? I assume the latter.
>

Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
for getting its inputs, so when you run it as you did what actually happens
is that you are getting the data from the local FS via Hadoop.


> From a beginner prospective I like to reduce the magic (at first) and see
> through the commands, and get up and running asap.
> Hence 2. I'll be using 3.
>

Hadoop already reduces the magic for you :-)


>
>
>>
>> As for the legacy-lucene vs SOLR what about having a parameter to
>> determine which one should be used and have a single script?
>>
>>
> Excellent idea. The default is solr for 1 and 3, but one passes parameter
> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
> get up and running fast (before knowing what solr is and set it up).
>

It would be nice to have a third possible value (i.e. none) for the
parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
crawling platform but do not do any indexing


>> Why do you want to get the info about ALL the urls? There is a readdb
>> -stats command which gives an summary of the content of the crawldb. If you
>> need to check a particular URL or domain, just use readdb -url and readdb
>> -regex (or whatever the name of the param is)
>>
>
>>
> At least when debugging/troubleshooting I found it useful to see which urls
> were fetched and the responses (robot_blocked, etc..).
> I can do that examining each $it_crawlddb in turn, since i don't know when
> that url was fetched (although since the fetching is pretty linear I could
> also find out, sth. like index in seeds/urls / $it_size.
>

better to do that by looking at the content of the segments using 'nutch
readseg -dump' or using 'hadoop fs -libjars nutch.job
segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
that most people will want to do so maybe comment it out in your script?

running hadoop in peudo distributed mode and looking at the hadoop web guis
(http://*localhost*:*50030*) gives you a lot of information about your crawl

It would definitely be better to have a single crawldb in your script.

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to