Thank you for the explanation. It was a bit confusing at first, but it actually makes sense.
Florent Doğacan Güney wrote: > Hi, > > On 5/17/07, Florent Gluck <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I've noticed that when doing a segment dump using readseg, several >> instances of the same CrawlDatum can be present in a given record. >> For example I have a segment with one single url (http://www.moma.org) >> and here is the dump below. I ran the following command: nutch readseg >> -dump segments/20070517113941 segdump -nocontent -noparsedata >> -noparsetext > > With this command, readseg reads from crawl_{fetch,generate,parse}. > >> >> Here is the first record: >> >> Recno:: 0 >> URL:: http://www.moma.org/ >> >> CrawlDatum:: >> Version: 5 >> Status: 1 (db_unfetched) >> Fetch time: Thu May 17 11:39:34 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 1.0 >> Signature: null >> Metadata: _ngt_:1179416381663 > > This one is from crawl_generate, you can see that it contains a _ngt_ > field. This datum is read by fetcher. > >> >> CrawlDatum:: >> Version: 5 >> Status: 65 (signature) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 0.0 days >> Score: 1.0 >> Signature: fe47b3db7c988541287fc6412ce0b923 >> Metadata: null > > This one is from crawl_parse. It contains signature of the parse text > which is used to dedup after index. > >> >> CrawlDatum:: >> Version: 5 >> Status: 33 (fetch_success) >> Fetch time: Thu May 17 11:39:49 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 1.0 >> Signature: fe47b3db7c988541287fc6412ce0b923 >> Metadata: _ngt_:1179416381663 _pst_:success(1), lastModified=0 >> > > This is from crawl_fetch. > >> Why are there 3 CrawlDatum fields? >> I assumed there would be only one CrawlDatum with status 33 >> (fetch_success). >> What is the purpose of the other two? >> >> Now, here is the 5th record: >> >> Recno:: 5 >> URL:: http://www.moma.org/application/x-shockwave-flash >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null >> >> CrawlDatum:: >> Version: 5 >> Status: 67 (linked) >> Fetch time: Thu May 17 11:39:51 EDT 2007 >> Modified time: Wed Dec 31 19:00:00 EST 1969 >> Retries since fetch: 0 >> Retry interval: 30.0 days >> Score: 0.03846154 >> Signature: null >> Metadata: null > > In this case, a linked status indicates an outlink. Most likely your > url (http://www.moma.org) contains six distinct outlinks to > http://www.moma.org/application/x-shockwave-flash. Each of them is put > as a seperate entity to crawl_parse. This is used in updatedb to > (among other things) calculate score. > >> >> >> There are 6 CrawlDatum fields and all of them are exactly identical. >> Is this a bug or am I missing something here? >> >> Any light on this matter would be greatly appreciated. >> Thank you. >> >> Florent >> > > ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
