Re: CrawlDatum: mislabeling?

Andrzej Bialecki Thu, 10 Apr 2008 01:39:46 -0700

[EMAIL PROTECTED] wrote:

Hi,


Does "Fetch time" in CrawlDatum really represent "Next fetch time"?

Example:
The URL below was just fetched.  After that bin/nutch readdb was run:

$ bin/nutch readdb /user/foo/crawl/crawldb -url http://www.foobar.com/

URL: http://www.foobar.com/
Version: 6
Status: 6 (db_notmodified)
Fetch time: Fri May 09 17:17:31 EDT 2008          <---- NOTE: 30 days from now??
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 3.955374E-8
Signature: f3ee31dcfde9ca40f4ed4a4e1bf66e24
Metadata: _pst_:temp_moved(13), lastModified=0: http://foobar.com/

Either the above "Fetch time" is off by 1 month, or the above "Fetch time" should really 
be labeled "Next fetch fime".
Looking at CrawlDatum, it looks like it's the later.  Is that so?

Well, this field serves two purposes, so the name is ambiguous onpurpose (and that's probably bad ;) ). CrawlDatum class is used in manycontexts, it's used to keep the (static) status of pages in CrawlDb, butit's also used during fetching / updating jobs to keep track of thecurrent (changing) status of pages as they are being fetched. E.g.fetchers will update this field to contain the actual fetch time (so itno longer carries the meaning "next fetch time" in that case - insteadits value is equal to the actual fetch time when the page as fetched).On the other hand, the CrawlDbReducer modifies this value to set thetime of the next fetch, and as such it's recorded in the CrawlDb ...


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: CrawlDatum: mislabeling?

Reply via email to