Tutorial%20on%20incremental%20crawling

Gabriele Kahlout Mon, 28 Mar 2011 06:50:06 -0700

On Mon, Mar 28, 2011 at 10:43 AM, Julien Nioche <
[email protected]> wrote:


> Hi Gabriele
>
>
>>> you don't need to have 2 *and *3. The hadoop commands will work on the
>>> local fs in a completely transparent way, it all depends on the way hadoop
>>> is configured. It isolates the way data are stored (local or distrib) from
>>> the client code i.e Nutch. By adding a separate script using fs, you'd add
>>> more confusion and lead beginners to think that they HAVE to use fs.
>>>
>>
>> I apologize for not having yet looked into hadoop in detail but I had
>> understood that it would abstract over the single machine fs.
>>
>
> No problems. It would be worth spending a bit of time reading about Hadoop
> if you want to get a better understanding of Nutch. Tom White's book is an
> excellent reference but the wikis and tutorials would be a good start
>
>
>
>> However, to get up and running after downloading nutch will the script
>> just work or will I have to configure hadoop? I assume the latter.
>>
>
> Nope. By default Hadoop uses the local FS. Nutch relies on the Hadoop API
> for getting its inputs, so when you run it as you did what actually happens
> is that you are getting the data from the local FS via Hadoop.
>

I'll look into it and update the script accordingly.

>
>
>> From a beginner prospective I like to reduce the magic (at first) and see
>> through the commands, and get up and running asap.
>> Hence 2. I'll be using 3.
>>
>
> Hadoop already reduces the magic for you :-)
>
>
Okay, if so I'll put the equivalent unix commands (mv/rm) in the comment of
the hadoop cmds and get rid of 2.


>
>>
>>>
>>> As for the legacy-lucene vs SOLR what about having a parameter to
>>> determine which one should be used and have a single script?
>>>
>>>
>> Excellent idea. The default is solr for 1 and 3, but one passes parameter
>> 'll' it will use the legacy lucene. The default for 2 is ll since we want to
>> get up and running fast (before knowing what solr is and set it up).
>>
>
> It would be nice to have a third possible value (i.e. none) for the
> parameter -indexer (besides solr and lucene). A lot of people use Nutch as a
> crawling platform but do not do any indexing
>
> agreed. Will add that too.


>
>>> Why do you want to get the info about ALL the urls? There is a readdb
>>> -stats command which gives an summary of the content of the crawldb. If you
>>> need to check a particular URL or domain, just use readdb -url and readdb
>>> -regex (or whatever the name of the param is)
>>>
>>
>>>
>> At least when debugging/troubleshooting I found it useful to see which
>> urls were fetched and the responses (robot_blocked, etc..).
>> I can do that examining each $it_crawlddb in turn, since i don't know when
>> that url was fetched (although since the fetching is pretty linear I could
>> also find out, sth. like index in seeds/urls / $it_size.
>>
>
> better to do that by looking at the content of the segments using 'nutch
> readseg -dump' or using 'hadoop fs -libjars nutch.job
> segment/SEGMENTNUM/crawl_data' for instance. That's probably not something
> that most people will want to do so maybe comment it out in your script?
>
>
> running hadoop in peudo distributed mode and looking at the hadoop web guis
> (http://*localhost*:*50030*) gives you a lot of information about your
> crawl
>
> It would definitely be better to have a single crawldb in your script.
>
>
agreed, maybe again an option and the default is none. But keep every
$it_crawldb instead of deleting and merging them.
I should be looking into the necessary Hadoop today and start updating the
script accordingly.

Julien
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
Regards,
K. Gabriele

--- unchanged since 20/9/10 ---
P.S. If the subject contains "[LON]" or the addressee acknowledges the
receipt within 48 hours then I don't resend the email.
subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x)
< Now + 48h) ⇒ ¬resend(I, this).

If an email is sent by a sender that is not a trusted contact or the email
does not contain a valid code then the email is not received. A valid code
starts with a hyphen and ends with "X".
∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
L(-[a-z]+[0-9]X)).

Re: http://wiki.apache.org/nutch/Tutorial%20on%20incremental%20crawling

Reply via email to