Re-crawling scenario and HTTP Headers

Mingfai Sat, 04 Apr 2009 05:19:17 -0700

hi,

I think I got a better picture of Droids now and have learnt things beyond
the Simple Runtime including the more advanced GaussianRandomDelayTimer and
SimpleTaskQueueWithHistory. It seems to me the SimpleTaskQueue is not useful
for most web crawling scenario as pages are usually linked to each others,
and SimpleTaskQueueWithHistory is very useful.


AFAIK, there is no mechanism that cater the re-crawling scenario. I wonder
if anyone has idea on:

   - how to determine a page/URL is changed?
      - follow cache and expiry date in the HTTP header
      - Size, plus and minus 5-15%
      - Text change detection algothmn, such as  Myer's diff algorithm (i
      only know the name :-) and i'm not sure if it is really meaningful to do
      detection in this way)
      http://code.google.com/p/google-diff-match-patch/

      - when to implement the detection logic in Droids?
   - We could have a Task Validator to check the fetch history and maybe
      reject the task if the expiry time is not over yet. This is the
first level
      of change detection.
      - At the parse time, as the content is first accessed, one could
      implement a parser that do change detection.

For both of the above case, there is a problem that the ContentEntity
doesn't contain the full set of HTTP Header. (at least, HTTP headers that
are relevant to change detection) Should all HTTP Headers be stored in the
ContentEntity?

Regards,
mingfai

Re-crawling scenario and HTTP Headers

Reply via email to