The information should be there. Have a look at the index-more plugin. this will read and parse the metadata. I am coding a modified index-basic parser to include this field and other meta data like the keywords and description fields.

you can get the date by including the routines normalizeMeta and addMetaField from the indexer-more in the indexer-basic.
To verify this I added teh following log statement:
     // convert key (but, not value) to lower-case
     normalized.setProperty(key.toLowerCase(),value);

LOG.info("normalize Meta key:"+ key.toLowerCase() + " value: " + value);


in the normalizeMeta routine
I then did a crawl, it generated output like:

051213 165604 normalize Meta key:server value: Apache/1.3.31 (Unix) mod_perl/1.29 PHP/4.3.9
051213 165604 normalize Meta key:date value: Tue, 13 Dec 2005 05:55:59 GMT
051213 165604 normalize Meta key:content-type value: text/html
051213 165604 normalize Meta key:connection value: close

Cheers

John Reidy.

K.A.Hussain Ali wrote:

HI all,

I try to get the modified date of the crawled pages from the meta information 
of the page.
But i get only null values..
Do Nutch uses the meta information of the pages or is there any way to get the 
last-modified date
of the crawled pages ?

Any help is greatly appreciated.
Thanks in advance.

regards
-Hussain



-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to