Tutorial%20on%20incremental%20crawling

Julien Nioche Sun, 27 Mar 2011 10:03:50 -0700

>  I think it is a good idea to have a script like this however your proposal
>> could be improved. It currently works only on a single machine and uses
>> commands such as mv, ls etc... which won't work on a pseudo or fully
>> distributed cluster. You should use the 'hadoop fs' commands instead.
>>
>
> Okay, let's go for 3 editions:
> 1. that's abridged and works only with solr (tersest script)
> 2. unabridged with local fs  - for begginners
> 3. hadoop unabridged
>


you don't need to have 2 *and *3. The hadoop commands will work on the local
fs in a completely transparent way, it all depends on the way hadoop is
configured. It isolates the way data are stored (local or distrib) from the
client code i.e Nutch. By adding a separate script using fs, you'd add more
confusion and lead beginners to think that they HAVE to use fs.

As for the legacy-lucene vs SOLR what about having a parameter to determine
which one should be used and have a single script?


>
>
>> If I understand the script correctly, you then merge different crawldbs.
>> Why do you do that? There should be one crawldb per crawl so I don't think
>> this is at all necessary.
>>
>> So that I get a single dump with info about all the urls crawled. On the
> scale of the web this is probably a bad idea, isn't it?
>

it would be a bad idea even on a medium scale. That sort of works on a
single machine but as soon as you'd get a bit of data you'd fill the space
on the disks + the whole thing would take ages.

However the point still is that there should be only one crawldb per crawl
and it contains all the urls you've injected / discovered


> But then how else could you inspect all the crawled urls at once?
>

Why do you want to get the info about ALL the urls? There is a readdb -stats
command which gives an summary of the content of the crawldb. If you need to
check a particular URL or domain, just use readdb -url and readdb -regex (or
whatever the name of the param is)


>
>
>> Having a script would definitely be a plus for beginners and would give
>> more flexibility than the crawl command.
>>
>
> I as the first of beginners. Crawl is not recommended for whole-web
> crawling i guess because it doesn't work incrementally. Why not add such
> option to crawl? Shall I feature-request/patch for that?
>

IMHO I'd rather have a good and reliable script to replace the Crawl
command. It does not help people understanding the underlying steps + having
the script would be easier to recover when there is a failure + there are
other issues with it e.g. the runaway parsing threads which are kept in the
VM. I am sure the all-in-one Crawl command helped many a user but the script
would do just as well

Thanks for yor contribution BTW

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to