Armel T. Nene wrote:
> Hi guys,
>
>  
>
> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not actually
> save the last modified date of files. I have run a crawl on my local file
> system and the web. When I dumped the content of crawldb for both crawl, the
> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if it's
> intended to be as is or if it's a bug. Therefore my question is:
>
>  
>
> *         How does the generator knows which file to crawl again?
>
> o        Is it looking at the fetch time?
>
> o        The modified date as this can be misleading?
>
>  
>
> There is a modified date returned in most http headers and files on file
> system all have modified date which is the last modified date. How come it's
> not stored in the crawldb?
>
>   

This is the issue described in NUTCH-61 - patches from that issue will 
be applied soon to trunk/ .

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to