Armel, Sorry,I haven't tried this patch yet.. ----- Original Message ----- From: "Armel T. Nene" <[EMAIL PROTECTED]> To: <nutch-dev@lucene.apache.org> Sent: Thursday, January 25, 2007 11:07 PM Subject: RE: Modified date in crawldb
> Chee, > > Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the > version, was able to apply fully but not entirely successful in running with > the XML parser plugin. If you have applied successfully let me know. > > Regards, > > Armel > ------------------------------------------------- > Armel T. Nene > iDNA Solutions > Tel: +44 (207) 257 6124 > Mobile: +44 (788) 695 0483 > http://blog.idna-solutions.com > > -----Original Message----- > From: chee wu [mailto:[EMAIL PROTECTED] > Sent: 25 January 2007 13:44 > To: nutch-dev@lucene.apache.org > Subject: Re: Modified date in crawldb > > I also had this question a few days ago,and I am using Nutch0.8.1.It seems > the "Modified data" will be used by Nutch-61, you can find detail at the > link below: > http://issues.apache.org/jira/browse/NUTCH-61 > > I haven't studied this JIRA, and just wrote a simple function to fulfill > this. > 1.Retrieve all the Date information contained in the page content, Regular > Expression is used to identify the date information. > 2.Chose the newest date got as the page modified date. > 3.Call the method of "setModifiedTime( )" of the crawlDataum object in > FetcherThread.Output( ). > Maybe you can use a parse filter to separate this function from the core > code. > I am also new to Nutch, if anything wrong ,please feel free point out. > > > ----- Original Message ----- > From: "Armel T. Nene" <[EMAIL PROTECTED]> > To: <nutch-dev@lucene.apache.org> > Sent: Thursday, January 25, 2007 7:52 PM > Subject: Modified date in crawldb > > >> Hi guys, >> >> >> >> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not > actually >> save the last modified date of files. I have run a crawl on my local file >> system and the web. When I dumped the content of crawldb for both crawl, > the >> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if > it's >> intended to be as is or if it's a bug. Therefore my question is: >> >> >> >> * How does the generator knows which file to crawl again? >> >> o Is it looking at the fetch time? >> >> o The modified date as this can be misleading? >> >> >> >> There is a modified date returned in most http headers and files on file >> system all have modified date which is the last modified date. How come > it's >> not stored in the crawldb? >> >> >> >> Here is an extract from my 2 crawls: >> >> >> >> http://dmoz.org/Arts/ Version: 4 >> >> Status: 2 (DB_fetched) >> >> Fetch time: Thu Feb 22 12:45:43 GMT 2007 >> >> Modified time: Thu Jan 01 01:00:00 GMT 1970 >> >> Retries since fetch: 0 >> >> Retry interval: 30.0 days >> >> Score: 0.013471641 >> >> Signature: fe52a0bcb1071070689d0f661c168648 >> >> Metadata: null >> >> >> >> file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_00000121.xml >> Version: 4 >> >> Status: 2 (DB_fetched) >> >> Fetch time: Sat Feb 24 10:31:44 GMT 2007 >> >> Modified time: Thu Jan 01 01:00:00 GMT 1970 >> >> Retries since fetch: 0 >> >> Retry interval: 30.0 days >> >> Score: 1.1035091E-4 >> >> Signature: 57254d9ca2988ce1bf7f92b6239d6ebc >> >> Metadata: null >> >> >> >> Looking forward to your reply. >> >> >> >> Regards, >> >> >> >> Armel >> >> >> >> ------------------------------------------------- >> >> Armel T. Nene >> >> iDNA Solutions >> >> Tel: +44 (207) 257 6124 >> >> Mobile: +44 (788) 695 0483 >> >> <http://blog.idna-solutions.com/> http://blog.idna-solutions.com >> >> >> >> > >