Tutorial%20on%20incremental%20crawling

Gabriele Kahlout Mon, 28 Mar 2011 17:00:59 -0700

K, hadoopized the script, though i've tried it only locally.
I rethought (lazyness convinced me) not to include the indexer parameter.


On Mon, Mar 28, 2011 at 10:50 AM, Gabriele Kahlout <[email protected]
> wrote:

>
>
> On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche <
> [email protected]> wrote:
>
>> Hi Gabriele
>>
>>
>>>> you don't need to have 2 *and *3. The hadoop commands will work on the
>>>> local fs in a completely transparent way, it all depends on the way hadoop
>>>> is configured. It isolates the way data are stored (local or distrib) from
>>>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>>>> more confusion and lead beginners to think that they HAVE to use fs.
>>>>
>>>
>>> I apologize for not having yet looked into hadoop in detail but I had
>>> understood that it would abstract over the single machine fs.
>>>
>>
>> No problems. It would be worth spending a bit of time reading about Hadoop
>> if you want to get a better understanding of Nutch. Tom White's book is an
>> excellent reference but the wikis and tutorials would be a good start
>>
>>
>>
>>> However, to get up and running after downloading nutch will the script
>>> just work or will I have to configure hadoop? I assume the latter.
>>>
>>
>> Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
>> for getting its inputs, so when you run it as you did what actually happens
>> is that you are getting the data from the local FS via Hadoop.
>>
>
> I'll look into it and update the script accordingly.
>
>>
>>
>>> From a beginner prospective I like to reduce the magic (at first) and see
>>> through the commands, and get up and running asap.
>>> Hence 2. I'll be using 3.
>>>
>>
>> Hadoop already reduces the magic for you :-)
>>
>>
> Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
> the hadoop cmds and get rid of 2.
>
>
>>
>>>
>>>>
>>>> As for the legacy-lucene vs SOLR what about having a parameter to
>>>> determine which one should be used and have a single script?
>>>>
>>>>
>>> Excellent idea. The default is solr for 1 and 3, but one passes parameter
>>> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
>>> get up and running fast (before knowing what solr is and set it up).
>>>
>>
>> It would be nice to have a third possible value (i.e. none) for the
>> parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
>> crawling platform but do not do any indexing
>>
>> agreed. Will add that too.
>
>
>>
>>>> Why do you want to get the info about ALL the urls? There is a readdb
>>>> -stats command which gives an summary of the content of the crawldb. If you
>>>> need to check a particular URL or domain, just use readdb -url and readdb
>>>> -regex (or whatever the name of the param is)
>>>>
>>>
>>>>
>>> At least when debugging/troubleshooting I found it useful to see which
>>> urls were fetched and the responses (robot_blocked, etc..).
>>> I can do that examining each $it_crawlddb in turn, since i don't know
>>> when that url was fetched (although since the fetching is pretty linear I
>>> could also find out, sth. like index in seeds/urls / $it_size.
>>>
>>
>> better to do that by looking at the content of the segments using 'nutch
>> readseg -dump' or using 'hadoop fs -libjars nutch.job
>> segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
>> that most people will want to do so maybe comment it out in your script?
>>
>>
>> running hadoop in peudo distributed mode and looking at the hadoop web
>> guis (http://*localhost*:*50030*) gives you a lot of information about
>> your crawl
>>
>> It would definitely be better to have a single crawldb in your script.
>>
>>
> agreed, maybe again an option and the default is none. But keep every
> $it_crawldb instead of deleting and merging them.
> I should be looking into the necessary Hadoop today and start updating the
> script accordingly.
>
> Julien
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x) < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>
>


-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to