dpol1 commented on issue #867: URL: https://github.com/apache/stormcrawler/issues/867#issuecomment-4874760934
> add a small random jitter to avoid that the retries after a day cause a spike in errors Taking this. I had dismissed jitter on the theory that per-host blocks are unsynchronized, but that assumption breaks exactly as you describe: the crawl collects most of its first 429s early on, the blocks escalate on the same schedule, and the capped ones expire together. A `fetcher.backoff.jitter` fraction on every computed block is cheap insurance. > Alternatively, keep them in the fetcher queues as long as the delay is very short (< 3 mins), then call `blockQueueUntil`. But that makes the implementation more complex. I'd keep the first version simple and always go through the frontier. The short-delay case costs one extra round-trip through the status stream, which seems acceptable; can revisit if it hurts in practice. > Would be good, if also the list of HTTP status codes triggering the back-off is configurable. Will do: `fetcher.backoff.status.codes`, defaulting to 429 and 503. The paper is a useful pointer for what else bot-protection stacks return. I'd leave 403 to configuration rather than the default, since it so often just means "forbidden". > Yes, definitely. A timeout is even worse, because resources are occupied until it's hit. Agreed on including it then. I'd still start with the flag off and flip the default once the basic back-off has seen some real crawls: exceptions are a noisier signal than a clean 429. > Let them fail: The URLs were not send as request and have not received a HTTP 429 yet, right? And there's no status which encodes a PURGE. Right, they were only queued, never sent. Letting them fail also keeps the counter honest: failures inside one block window escalate at most once, so the backlog cannot inflate the delay. One update since my comment above: following the discussion on #1973, the signal will ride the generic `queue` stream `(key, metadata)` from the status updater rather than a dedicated stream from the fetcher. The consumer can tell the two cases apart from the metadata (status code plus presence of the header), so the mechanics described here are unchanged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
