Tutorial%20on%20incremental%20crawling

Gabriele Kahlout Sun, 27 Mar 2011 10:43:06 -0700

On Sun, Mar 27, 2011 at 7:03 PM, Julien Nioche <
[email protected]> wrote:


>
>  I think it is a good idea to have a script like this however your proposal
>>> could be improved. It currently works only on a single machine and uses
>>> commands such as mv, ls etc... which won't work on a pseudo or fully
>>> distributed cluster. You should use the 'hadoop fs' commands instead.
>>>
>>
>> Okay, let's go for 3 editions:
>> 1. that's abridged and works only with solr (tersest script)
>> 2. unabridged with local fs  - for begginners
>> 3. hadoop unabridged
>>
>
> you don't need to have 2 *and *3. The hadoop commands will work on the
> local fs in a completely transparent way, it all depends on the way hadoop
> is configured. It isolates the way data are stored (local or distrib) from
> the client code i.e Nutch. By adding a separate script using fs, you'd add
> more confusion and lead beginners to think that they HAVE to use fs.
>

I apologize for not having yet looked into hadoop in detail but I had
understood that it would abstract over the single machine fs. However, to
get up and running after downloading nutch will the script just work or will
I have to configure hadoop? I assume the latter. From a beginner prospective
I like to reduce the magic (at first) and see through the commands, and get
up and running asap.
Hence 2. I'll be using 3.


>
> As for the legacy-lucene vs SOLR what about having a parameter to determine
> which one should be used and have a single script?
>
>
Excellent idea. The default is solr for 1 and 3, but one passes parameter
'll' it will use the legacy lucene. The default for 2 is ll since we want to
get up and running fast (before knowing what solr is and set it up).


>
>>
>>> If I understand the script correctly, you then merge different crawldbs.
>>> Why do you do that? There should be one crawldb per crawl so I don't think
>>> this is at all necessary.
>>>
>>> So that I get a single dump with info about all the urls crawled. On the
>> scale of the web this is probably a bad idea, isn't it?
>>
>
> it would be a bad idea even on a medium scale. That sort of works on a
> single machine but as soon as you'd get a bit of data you'd fill the space
> on the disks + the whole thing would take ages.
>

> However the point still is that there should be only one crawldb per crawl
> and it contains all the urls you've injected / discovered
>
>
>>  But then how else could you inspect all the crawled urls at once?
>>
>
> Why do you want to get the info about ALL the urls? There is a readdb
> -stats command which gives an summary of the content of the crawldb. If you
> need to check a particular URL or domain, just use readdb -url and readdb
> -regex (or whatever the name of the param is)
>

>
At least when debugging/troubleshooting I found it useful to see which urls
were fetched and the responses (robot_blocked, etc..).
I can do that examining each $it_crawlddb in turn, since i don't know when
that url was fetched (although since the fetching is pretty linear I could
also find out, sth. like index in seeds/urls / $it_size.

Now I don't about hadoop is a single file stored on a single machine. My
expectation/hope was that the underlying fs loads into memory only the
portions of text i'm viewing (I've seen that around) and I dunno how it'll
handle ctrl+F (maybe some index). And if it has disk space issues it breaks
that up on to several machines, transparently.


>
>>
>>> Having a script would definitely be a plus for beginners and would give
>>> more flexibility than the crawl command.
>>>
>>
>> I as the first of beginners. Crawl is not recommended for whole-web
>> crawling i guess because it doesn't work incrementally. Why not add such
>> option to crawl? Shall I feature-request/patch for that?
>>
>
> IMHO I'd rather have a good and reliable script to replace the Crawl
> command.
>

> Thanks for yor contribution BTW
>

welcome. I like Apache's stuff, and thank you for saving me the trouble of
re-implementing a search engine atop much more limited frameworks!


> Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to