Re: Modified date in crawldb

chee wu Thu, 25 Jan 2007 17:16:28 -0800

Armel,
   Sorry,I haven't tried this patch yet..

----- Original Message ----- 
From: "Armel T. Nene" <[EMAIL PROTECTED]>
To: <nutch-dev@lucene.apache.org>
Sent: Thursday, January 25, 2007 11:07 PM
Subject: RE: Modified date in crawldb



> Chee,
> 
> Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
> version, was able to apply fully but not entirely successful in running with
> the XML parser plugin. If you have applied successfully let me know.
> 
> Regards,
> 
> Armel 
> -------------------------------------------------
> Armel T. Nene
> iDNA Solutions
> Tel: +44 (207) 257 6124
> Mobile: +44 (788) 695 0483 
> http://blog.idna-solutions.com
> 
> -----Original Message-----
> From: chee wu [mailto:[EMAIL PROTECTED] 
> Sent: 25 January 2007 13:44
> To: nutch-dev@lucene.apache.org
> Subject: Re: Modified date in crawldb
> 
> I also had this question a few days ago,and I am using Nutch0.8.1.It seems
> the "Modified data" will be used by Nutch-61, you can find detail at the
> link below: 
> http://issues.apache.org/jira/browse/NUTCH-61
> 
> I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
> this.
> 1.Retrieve all the Date information contained in the page content, Regular
> Expression is used to identify the date information.
> 2.Chose the newest date got as the page modified date.
> 3.Call  the method of  "setModifiedTime( )"  of the crawlDataum object in
> FetcherThread.Output( ).
> Maybe you can use a parse filter to separate this function from the core
> code.
> I am also new to Nutch, if  anything  wrong ,please feel free point out.
> 
> 
> ----- Original Message ----- 
> From: "Armel T. Nene" <[EMAIL PROTECTED]>
> To: <nutch-dev@lucene.apache.org>
> Sent: Thursday, January 25, 2007 7:52 PM
> Subject: Modified date in crawldb
> 
> 
>> Hi guys,
>> 
>> 
>> 
>> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
> actually
>> save the last modified date of files. I have run a crawl on my local file
>> system and the web. When I dumped the content of crawldb for both crawl,
> the
>> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
> it's
>> intended to be as is or if it's a bug. Therefore my question is:
>> 
>> 
>> 
>> *         How does the generator knows which file to crawl again?
>> 
>> o        Is it looking at the fetch time?
>> 
>> o        The modified date as this can be misleading?
>> 
>> 
>> 
>> There is a modified date returned in most http headers and files on file
>> system all have modified date which is the last modified date. How come
> it's
>> not stored in the crawldb?
>> 
>> 
>> 
>> Here is an extract from my 2 crawls:
>> 
>> 
>> 
>> http://dmoz.org/Arts/   Version: 4
>> 
>> Status: 2 (DB_fetched)
>> 
>> Fetch time: Thu Feb 22 12:45:43 GMT 2007
>> 
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> 
>> Retries since fetch: 0
>> 
>> Retry interval: 30.0 days
>> 
>> Score: 0.013471641
>> 
>> Signature: fe52a0bcb1071070689d0f661c168648
>> 
>> Metadata: null
>> 
>> 
>> 
>> file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_00000121.xml
>> Version: 4
>> 
>> Status: 2 (DB_fetched)
>> 
>> Fetch time: Sat Feb 24 10:31:44 GMT 2007
>> 
>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>> 
>> Retries since fetch: 0
>> 
>> Retry interval: 30.0 days
>> 
>> Score: 1.1035091E-4
>> 
>> Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
>> 
>> Metadata: null
>> 
>> 
>> 
>> Looking forward to your reply.
>> 
>> 
>> 
>> Regards,
>> 
>> 
>> 
>> Armel
>> 
>> 
>> 
>> -------------------------------------------------
>> 
>> Armel T. Nene
>> 
>> iDNA Solutions
>> 
>> Tel: +44 (207) 257 6124
>> 
>> Mobile: +44 (788) 695 0483 
>> 
>> <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
>> 
>> 
>> 
>>
> 
>

Re: Modified date in crawldb

Reply via email to