Re: readseg bug?

Florent Gluck Thu, 17 May 2007 14:25:20 -0700

Thank you for the explanation. It was a bit confusing at first, but itactually makes sense.


Florent


Doğacan Güney wrote:

Hi,

On 5/17/07, Florent Gluck <[EMAIL PROTECTED]> wrote:

Hi all,

I've noticed that when doing a segment dump using readseg, several
instances of the same CrawlDatum can be present in a given record.
For example I have a segment with one single url (http://www.moma.org)
and here is the dump below.  I ran the following command:  nutch readseg

-dump segments/20070517113941 segdump -nocontent -noparsedata-noparsetext


With this command, readseg reads from crawl_{fetch,generate,parse}.


Here is the first record:

Recno:: 0
URL:: http://www.moma.org/

CrawlDatum::
Version: 5
Status: 1 (db_unfetched)
Fetch time: Thu May 17 11:39:34 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: null
Metadata: _ngt_:1179416381663


This one is from crawl_generate, you can see that it contains a _ngt_
field. This datum is read by fetcher.


CrawlDatum::
Version: 5
Status: 65 (signature)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 0.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: null


This one is from crawl_parse. It contains signature of the parse text
which is used to dedup after index.


CrawlDatum::
Version: 5
Status: 33 (fetch_success)
Fetch time: Thu May 17 11:39:49 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.0
Signature: fe47b3db7c988541287fc6412ce0b923
Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0


This is from crawl_fetch.

Why are there 3 CrawlDatum fields?

I assumed there would be only one CrawlDatum with status 33(fetch_success).

What is the purpose of the other two?

Now, here is the 5th record:

Recno:: 5
URL:: http://www.moma.org/application/x-shockwave-flash

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null

CrawlDatum::
Version: 5
Status: 67 (linked)
Fetch time: Thu May 17 11:39:51 EDT 2007
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.03846154
Signature: null
Metadata: null


In this case, a linked status indicates an outlink. Most likely your
url (http://www.moma.org) contains six distinct outlinks to
http://www.moma.org/application/x-shockwave-flash. Each of them is put
as a seperate entity to crawl_parse. This is used in updatedb to
(among other things) calculate score.



There are 6 CrawlDatum fields and all of them are exactly identical.
Is this a bug or am I missing something here?

Any light on this matter would be greatly appreciated.
Thank you.

Florent

Re: readseg bug?

Reply via email to