Andrzej Bialecki wrote:
Doug Cutting wrote:
Modify CrawlDatum to store the MD5Hash of the content of fetched urls.
Yes, this is required to detect unmodified content. A small note: plain
MD5Hash(byte[] content) is quite ineffective for many pages, e.g. pages
with a counter, or with ads. It would be good to provide a framework for
other implementations of "page equality" - for now perhaps we should
just say that this value is a byte[], and not specifically an MD5Hash.
That's reasonable. But we should also make it clear that the larger
this gets, the slower that crawldb updates will be.
Other additions to CrawlDatum for consideration:
* last modified time, not just the last fetched time - these two are
different, and the fetching policy will depend on both. E.g. to
synchronize with the page change cycle it is necessary to know the time
of the previous modification seen by Nutch. I've done simulations, which
show that if we don't track this value then the fetchInterval
adjustments won't stabilize even if the page change cycle is fixed.
Instead of a long, these could be two unsigned ints, seconds since
epoch. That would be good for another 100 years.
* segment name from the last updatedb. I'm not fully convinced about
this, but consider the following:
I think this is needed in order to check which segments may be safely
deleted, because there are no more active pages in them. If you enable a
variable fetchInterval, then after a while you will end up with widely
ranging intervals - some pages will have a daily or hourly period, some
others will have a period of several months. Add to this the fact that
you start counting the time for each page at different moments, and then
the oldest page you have could be as old as maxFetchInterval (whatever
that is, Float.MAX_VALUE or some other maximum you set). Most likely
such old pages would live in segments with very little current data.
This is only possible if maxFetchInterval is very long and one is doing
frequent large updates. That to me is misuse. If maxFetchInterval is,
e.g., 30 days and one updates 1% of the collection every day, then the
wasted space would be small, and segments may be discarded after
maxFetchInterval. Is that impractical?
I also think we could go crazy trying to make things work indefinitely
with incremental crawling. Rather I think one should periodically
re-crawl from scratch, using the existing crawldb to bootstrap. This
way dead links are eventually retried, and urls that are no longer
referred to can be GC'd.
Alternatively, we could add Properties to CrawlDatum, and let people put
whatever they wish there...
This is probably a good idea, although it makes CrawlDatum bigger and
algorithms slower, since strings must be parsed. So I'd argue that the
default mechanism not rely on properties. Is that a premature optimization?
In the original patchset I had a notion of pluggable FetchSchedule-s. I
think this would be an ideal place to make such decisions.
Implementations would be pluggable in a similar way as URLFilter, with
the DefaultFetchSchedule doing what we do today.
+1
4. Update the crawl db & link db, index the new segment, dedup, etc.
When updating the crawl db, scores for existing urls should not
change, since the scoring method we're using (OPIC) assumes each page
is fetched only once.
I would love to refactor this part too, to make the scoring mechanism
abstracted in a similar way, so that you could plug in different scoring
implementations. The float value in CrawlDatum is opaque enough to
support different scoring mechanisms.
+1
Doug