Hi Richard,

I'm not a Droids expert, but I can answer some of the more general questions below.

On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:


2) How best to do an incremental crawl? I'm going to want to do if- last-modified checks as I crawl.

Most crawlers, including Nutch, don't rely on any last modified data returned by servers. Why? Well, it's wrong much of the time. So you wind up having to fetch the content and (ideally) generate/ compare a "signature hash" that tries to ignore meaningless differences such as a "number of visitors" counter.

Good point. We for the most part have fairly decent systems for which to check the modified time against. My biggest concern is having to reprocess all of the PDF's and docs. Those for the most part are being pulled by HTTPD from the filesystem and should have a valid last modified. Furthermore, they do have a stable signature, so your suggestion would work as well. Is the ETag reliable when presented?

From what I've seen, yes - in that it's less likely to be bogus if present, but it's still possible for the back-end system to generate a false positive identical etag.

You could use etags plus content length from the response header to further improve the reliability of the "no need to download" result.

If the main concern is processing time (versus download time) then I'd just generate a hash from the content bytes, and use that. You occasionally will reprocess a file that is in fact logically unchanged, where there's an etag or last-modified value in the response header that would have allowed you to skip this step, but that shouldn't be a common case.

-- Ken



--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to