Armel T. Nene wrote:
Hi guys,

I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:

*         How does the generator knows which file to crawl again?

o        Is it looking at the fetch time?

o        The modified date as this can be misleading?

There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?


This is the issue described in NUTCH-61 - patches from that issue will be applied soon to trunk/ .

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to