Doug Cutting wrote:
Andrzej Bialecki wrote:
So, I would propose a deadline of Aug 8 for the last commits, and then
perhaps Aug 15 for the release?
Sounds good to me. Thanks for helping with this!
Unfortunately, the patches related to detecting the unmodified content
will have to wait until after the release.
Here's the problem: It's quite easy to add this checking and recording
capability to all fetcher plugins, fetchlist generation and db update
tools, and I've done this in my local patches. However, after a while I
discovered a serious problem in the way Nutch currently manages "phasing
out" of old segment data. If we assume that we always refresh after some
fixed interval (30 days, or whatever), then we can safely delete
segments older than 30 days. If the interval varies, then potentially we
could be stuck with some segments with very old (but still valid) data.
This is very inefficient, because in a single given segment there might
be only a couple of such pages left after a while, and the rest of them
would have to be removed again and again by deduplication because newer
pages would exist in newer segments.
Moreover (and this is the worst problem) if such segments are lost, the
information in webdb must be updated in a way to force refetching, even
though the "If-Modified-Since" or the MD5 points out that the page is
still unchanged since the last fetching. Currently the only way to do
this is to "add days" - but if we use a variable refetch interval then
it doesn't make much sense. I think we need to track in a better way
which pages are "missing" from the segments, and have to be re-fetched,
or to have a better DB update mechanism if we lose some segments.
Perhaps we should extend the Page to record which segment holds the
latest version of the page? But segments don't have unique ID's now (a
directory name is too fragile and too easily changed) ...
Related question: in the FetchListEntry we have a "fetch" flag. I think
that after minor modifications of the FetchListTool (to generate only
entries, which we are supposed to fetch) we could get rid of this flag,
or change its semantics to mean "unconditionally fetch, even if unmodified".
Any comments?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com