[EMAIL PROTECTED] wrote:
Hi,
Does "Fetch time" in CrawlDatum really represent "Next fetch time"?
Example:
The URL below was just fetched. After that bin/nutch readdb was run:
$ bin/nutch readdb /user/foo/crawl/crawldb -url http://www.foobar.com/
URL: http://www.foobar.com/
Version: 6
Status: 6 (db_notmodified)
Fetch time: Fri May 09 17:17:31 EDT 2008 <---- NOTE: 30 days from now??
Modified time: Wed Dec 31 19:00:00 EST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 3.955374E-8
Signature: f3ee31dcfde9ca40f4ed4a4e1bf66e24
Metadata: _pst_:temp_moved(13), lastModified=0: http://foobar.com/
Either the above "Fetch time" is off by 1 month, or the above "Fetch time" should really
be labeled "Next fetch fime".
Looking at CrawlDatum, it looks like it's the later. Is that so?
Well, this field serves two purposes, so the name is ambiguous on
purpose (and that's probably bad ;) ). CrawlDatum class is used in many
contexts, it's used to keep the (static) status of pages in CrawlDb, but
it's also used during fetching / updating jobs to keep track of the
current (changing) status of pages as they are being fetched. E.g.
fetchers will update this field to contain the actual fetch time (so it
no longer carries the meaning "next fetch time" in that case - instead
its value is equal to the actual fetch time when the page as fetched).
On the other hand, the CrawlDbReducer modifies this value to set the
time of the next fetch, and as such it's recorded in the CrawlDb ...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com