Hi guys,
I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
save the last modified date of files. I have run a crawl on my local file
system and the web. When I dumped the content of crawldb for both crawl, the
modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
intended to be as is or if it's a bug. Therefore my question is:
* How does the generator knows which file to crawl again?
o Is it looking at the fetch time?
o The modified date as this can be misleading?
There is a modified date returned in most http headers and files on file
system all have modified date which is the last modified date. How come it's
not stored in the crawldb?
Here is an extract from my 2 crawls:
http://dmoz.org/Arts/ Version: 4
Status: 2 (DB_fetched)
Fetch time: Thu Feb 22 12:45:43 GMT 2007
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.013471641
Signature: fe52a0bcb1071070689d0f661c168648
Metadata: null
file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_00000121.xml
Version: 4
Status: 2 (DB_fetched)
Fetch time: Sat Feb 24 10:31:44 GMT 2007
Modified time: Thu Jan 01 01:00:00 GMT 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 1.1035091E-4
Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
Metadata: null
Looking forward to your reply.
Regards,
Armel
-------------------------------------------------
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483
<http://blog.idna-solutions.com/> http://blog.idna-solutions.com
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers