Re: [I] Send host info on a specific stream [stormcrawler]

via GitHub Fri, 03 Jul 2026 02:33:33 -0700


dpol1 commented on issue #867:
URL: https://github.com/apache/stormcrawler/issues/867#issuecomment-4874760934


   > add a small random jitter to avoid that the retries after a day cause a 
spike in errors
   
   Taking this. I had dismissed jitter on the theory that per-host blocks are 
unsynchronized, but that assumption breaks exactly as you describe: the crawl 
collects most of its first 429s early on, the blocks escalate on the same 
schedule, and the capped ones expire together. A `fetcher.backoff.jitter` 
fraction on every computed block is cheap insurance.
   
   > Alternatively, keep them in the fetcher queues as long as the delay is 
very short (< 3 mins), then call `blockQueueUntil`. But that makes the 
implementation more complex.
   
   I'd keep the first version simple and always go through the frontier. The 
short-delay case costs one extra round-trip through the status stream, which 
seems acceptable; can revisit if it hurts in practice.
   
   > Would be good, if also the list of HTTP status codes triggering the 
back-off is configurable.
   
   Will do: `fetcher.backoff.status.codes`, defaulting to 429 and 503. The 
paper is a useful pointer for what else bot-protection stacks return. I'd leave 
403 to configuration rather than the default, since it so often just means 
"forbidden".
   
   > Yes, definitely. A timeout is even worse, because resources are occupied 
until it's hit.
   
   Agreed on including it then. I'd still start with the flag off and flip the 
default once the basic back-off has seen some real crawls: exceptions are a 
noisier signal than a clean 429.
   
   > Let them fail: The URLs were not send as request and have not received a 
HTTP 429 yet, right? And there's no status which encodes a PURGE.
   
   Right, they were only queued, never sent. Letting them fail also keeps the 
counter honest: failures inside one block window escalate at most once, so the 
backlog cannot inflate the delay.
   
   One update since my comment above: following the discussion on #1973, the 
signal will ride the generic `queue` stream `(key, metadata)` from the status 
updater rather than a dedicated stream from the fetcher. The consumer can tell 
the two cases apart from the metadata (status code plus presence of the 
header), so the mechanics described here are unchanged.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Send host info on a specific stream [stormcrawler]

Reply via email to