The information should be there. Have a look at the index-more plugin.
this will read and parse the metadata.
I am coding a modified index-basic parser to include this field and
other meta data like the keywords and description fields.
you can get the date by including the routines normalizeMeta and
addMetaField from the indexer-more in the indexer-basic.
To verify this I added teh following log statement:
// convert key (but, not value) to lower-case
normalized.setProperty(key.toLowerCase(),value);
LOG.info("normalize Meta key:"+ key.toLowerCase() + " value: " +
value);
in the normalizeMeta routine
I then did a crawl, it generated output like:
051213 165604 normalize Meta key:server value: Apache/1.3.31 (Unix)
mod_perl/1.29 PHP/4.3.9
051213 165604 normalize Meta key:date value: Tue, 13 Dec 2005 05:55:59 GMT
051213 165604 normalize Meta key:content-type value: text/html
051213 165604 normalize Meta key:connection value: close
Cheers
John Reidy.
K.A.Hussain Ali wrote:
HI all,
I try to get the modified date of the crawled pages from the meta information
of the page.
But i get only null values..
Do Nutch uses the meta information of the pages or is there any way to get the
last-modified date
of the crawled pages ?
Any help is greatly appreciated.
Thanks in advance.
regards
-Hussain
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general