Thanks Ken.
On 1/12/2010 5:17 PM, Ken Krugler wrote:
Hi Richard,
I'm not a Droids expert, but I can answer some of the more general
questions below.
On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:
2) How best to do an incremental crawl? I'm going to want to do
if-last-modified checks as I crawl.
Most crawlers, including Nutch, don't rely on any last modified data
returned by servers. Why? Well, it's wrong much of the time. So you
wind up having to fetch the content and (ideally) generate/compare a
"signature hash" that tries to ignore meaningless differences such as
a "number of visitors" counter.
Good point. We for the most part have fairly decent systems for which to
check the modified time against. My biggest concern is having to
reprocess all of the PDF's and docs. Those for the most part are being
pulled by HTTPD from the filesystem and should have a valid last
modified. Furthermore, they do have a stable signature, so your
suggestion would work as well. Is the ETag reliable when presented?