dpol1 commented on issue #867: URL: https://github.com/apache/stormcrawler/issues/867#issuecomment-4863272990
Now that #1973 covers the explicit case (429/503 with a Retry-After header), I'd like to tackle the case we probably see most in the wild: servers that rate-limit without sending any header at all. Today we just fetch the host again and collect another 429. The idea, kept short: FetcherBolt already emits on the `hostinfo` stream when there's a header. I'd widen that signal with a `kind` field, so a headerless 429/503 becomes a "pressure" event. The consumer bolt keeps a small per-host counter and blocks the queue in the frontier for a growing duration: 60s the first time, then 2 minutes, 4, 8, capped at a day. If the host stays quiet for half an hour the counter is forgotten and the next incident starts from scratch. Nothing new under the sun: Nutch has done this since 1.19 ( [NUTCH-2946](https://issues.apache.org/jira/browse/NUTCH-2946), same doubling curve, same 30 minute reset). The difference is where the state lives.Nutch keeps it inside the fetcher and later had to patch around fetching going stale when only backed-off queues remain ([NUTCH-3114](https://issues.apache.org/jira/browse/NUTCH-3114)). With the state in the frontier a blocked host never reaches the fetcher in the first place, so that class of problem can't happen. Workers stay stateless, which is also the reason #1944 was closed. One subtlety worth calling out. When the first 429 arrives there may be a dozen URLs of that host already sitting in the fetcher's internal queues. They will fail one after the other, and each of those failures must not escalate the counter, otherwise a single incident jumps straight to the 24h cap. So: escalate at most once per block window. Same trick TCP uses, where multiple losses inside one RTT count as a single congestion event. A couple of questions before I write any code: 1. Enforcement: I'd use `blockQueueUntil` with growing durations. Blocks expire on their own, so there is nothing to reset and no way to leave a host throttled forever. `setDelay` would be the gentler alternative (host stays crawlable, just slower), but it needs an explicit reset and `setDelay(key, 0)` also wipes any server-side default delay for that queue. Does anyone feel strongly about throttling instead of blocking? 2. Should the headerless back-off also count fetch exceptions (timeouts, connection refused), like Nutch does and #1106 asks for? I'd put that behind a config flag, off by default. 3. The 429'd URLs already sitting in the fetcher's queues: purge them to the status stream, or let them fail and reschedule on their own? Same question is open on #1973 , whatever we decide there applies here too. If this sounds reasonable I'll put a PR together once #1973 has landed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
