Hi,
that might look strange but it's not a bug.
It could be improved, see below, simply because
it's not obvious - I also stumbled over this
point some time ago. It also pops up from time
to time on the mailing lists, see references below.
- when indexing the modified time (sent by the server)
the time from the Content class
content.getMetadata().get(Response.LAST_MODIFIED)
is used by the index-more plugin
- the "modified time" stored in the CrawlDb is not the
modified time sent by the server but the time of the
last "real" fetch, excluding fetches which returned
an unmodified document, either by if-modified-since
HTTP requests or by a signature comparison.
See also NUTCH-933.
- it is set by setFetchSchedule(...) but only by
AdapativeFetchSchedule not by DefaultFetchSchedule
The latter does not use, while the former "adapts"
the re-fetch interval dependent on the change frequency.
- the lastModified field in ProtocolStatus shown by toString()
_pst_: success(1), lastModified=0
was obviously never used. It's probably just a relict.
If you remove it CrawDbs become incompatible. But it
could be filled with the modified time returned by the
server (or, e.g. the file system for protocol-file).
As said, these could be improvements:
1 also set modified time by DefaultFetchSchedule
2 set ProtocolStatus.lastModified if modified time is available
Please, feel free to open Jira issues for these.
Thanks,
Sebastian
References:
https://issues.apache.org/jira/browse/NUTCH-933
http://lucene.472066.n3.nabble.com/setting-modifiedTime-in-DefaultFetchSchedule-td4020457.html
http://www.mail-archive.com/[email protected]/msg15056.html
On 11/06/2015 01:18 AM, Thamme Gowda N. wrote:
> Hello,
>
> I found a strange issue with 'Modified time' in nutch crawldb.
>
> I dumped the crawldb using the command
> / nutch readdb xx -dump yy/
>
> And inspected the 'Modified time' in the dumped content.
>
> Surprisingly, the 'Modified time' is invalid. All the pages have 'Modified
> time: Wed Dec 31 16:00:00
> PST 1969' (That is 0 - 8Hrs). It is worth noting that 'lastModified=0' in
> Metadata.
>
>
> But, I see actual value in the response header.
>
>
>
> I am using Nutch 1.11, can you verify whether this functionality is broken?
>
> --
> Regards,
> Thamme Gowda N