Chee,

Have you successfully applied Nutch-61 to Nutch 0.8.1. I worked on the
version, was able to apply fully but not entirely successful in running with
the XML parser plugin. If you have applied successfully let me know.

Regards,

Armel 
-------------------------------------------------
Armel T. Nene
iDNA Solutions
Tel: +44 (207) 257 6124
Mobile: +44 (788) 695 0483 
http://blog.idna-solutions.com

-----Original Message-----
From: chee wu [mailto:[EMAIL PROTECTED] 
Sent: 25 January 2007 13:44
To: nutch-dev@lucene.apache.org
Subject: Re: Modified date in crawldb

I also had this question a few days ago,and I am using Nutch0.8.1.It seems
the "Modified data" will be used by Nutch-61, you can find detail at the
link below: 
 http://issues.apache.org/jira/browse/NUTCH-61

I haven't studied this JIRA, and  just  wrote a simple function  to fulfill
this.
1.Retrieve all the Date information contained in the page content, Regular
Expression is used to identify the date information.
2.Chose the newest date got as the page modified date.
3.Call  the method of  "setModifiedTime( )"  of the crawlDataum object in
FetcherThread.Output( ).
Maybe you can use a parse filter to separate this function from the core
code.
I am also new to Nutch, if  anything  wrong ,please feel free point out.


----- Original Message ----- 
From: "Armel T. Nene" <[EMAIL PROTECTED]>
To: <nutch-dev@lucene.apache.org>
Sent: Thursday, January 25, 2007 7:52 PM
Subject: Modified date in crawldb


> Hi guys,
> 
> 
> 
> I am using Nutch 0.8.2-dev. I have notice that the crawldb does not
actually
> save the last modified date of files. I have run a crawl on my local file
> system and the web. When I dumped the content of crawldb for both crawl,
the
> modified date of the files were set to 01-Jan-1970 01:00:00. I don't if
it's
> intended to be as is or if it's a bug. Therefore my question is:
> 
> 
> 
> *         How does the generator knows which file to crawl again?
> 
> o        Is it looking at the fetch time?
> 
> o        The modified date as this can be misleading?
> 
> 
> 
> There is a modified date returned in most http headers and files on file
> system all have modified date which is the last modified date. How come
it's
> not stored in the crawldb?
> 
> 
> 
> Here is an extract from my 2 crawls:
> 
> 
> 
> http://dmoz.org/Arts/   Version: 4
> 
> Status: 2 (DB_fetched)
> 
> Fetch time: Thu Feb 22 12:45:43 GMT 2007
> 
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> 
> Retries since fetch: 0
> 
> Retry interval: 30.0 days
> 
> Score: 0.013471641
> 
> Signature: fe52a0bcb1071070689d0f661c168648
> 
> Metadata: null
> 
> 
> 
> file:/C:/TeamBinder/AddressBook/GLOBAL/GLOBAL_fAdrBook_00000121.xml
> Version: 4
> 
> Status: 2 (DB_fetched)
> 
> Fetch time: Sat Feb 24 10:31:44 GMT 2007
> 
> Modified time: Thu Jan 01 01:00:00 GMT 1970
> 
> Retries since fetch: 0
> 
> Retry interval: 30.0 days
> 
> Score: 1.1035091E-4
> 
> Signature: 57254d9ca2988ce1bf7f92b6239d6ebc
> 
> Metadata: null
> 
> 
> 
> Looking forward to your reply.
> 
> 
> 
> Regards,
> 
> 
> 
> Armel
> 
> 
> 
> -------------------------------------------------
> 
> Armel T. Nene
> 
> iDNA Solutions
> 
> Tel: +44 (207) 257 6124
> 
> Mobile: +44 (788) 695 0483 
> 
> <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
> 
> 
> 
>


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to