Re: Couple of Droids questions

Richard Frovarp Tue, 12 Jan 2010 19:45:57 -0800

Thanks Ken.

On 1/12/2010 5:17 PM, Ken Krugler wrote:

Hi Richard,
I'm not a Droids expert, but I can answer some of the more generalquestions below.
On Jan 12, 2010, at 2:44pm, Richard Frovarp wrote:
2) How best to do an incremental crawl? I'm going to want to doif-last-modified checks as I crawl.
Most crawlers, including Nutch, don't rely on any last modified datareturned by servers. Why? Well, it's wrong much of the time. So youwind up having to fetch the content and (ideally) generate/compare a"signature hash" that tries to ignore meaningless differences such asa "number of visitors" counter.

Good point. We for the most part have fairly decent systems for which tocheck the modified time against. My biggest concern is having toreprocess all of the PDF's and docs. Those for the most part are beingpulled by HTTPD from the filesystem and should have a valid lastmodified. Furthermore, they do have a stable signature, so yoursuggestion would work as well. Is the ETag reliable when presented?

Re: Couple of Droids questions

Reply via email to