Re: [I] Send host info on a specific stream [stormcrawler]

via GitHub Thu, 02 Jul 2026 00:30:43 -0700


dpol1 commented on issue #867:
URL: https://github.com/apache/stormcrawler/issues/867#issuecomment-4863272990


   Now that #1973  covers the explicit case (429/503 with a Retry-After 
header), I'd like to tackle the case we probably see most in the wild: servers 
that rate-limit without sending any header at all. Today we just fetch the host 
again and collect another 429.
   
   The idea, kept short:
   
   FetcherBolt already emits on the `hostinfo` stream when there's a header. 
I'd widen that signal with a `kind` field, so a headerless 429/503 becomes a 
"pressure" event. The consumer bolt keeps a small per-host counter and blocks 
the queue in the frontier for a growing duration: 60s the first time, then 2 
minutes, 4, 8, capped at a day. If the host stays quiet for half an hour the 
counter is forgotten and the next incident starts from scratch.
   
   Nothing new under the sun: Nutch has done this since 1.19 ( 
[NUTCH-2946](https://issues.apache.org/jira/browse/NUTCH-2946), same doubling 
curve, same 30 minute reset). The difference is where the state lives.Nutch 
keeps it inside the fetcher and later had to patch around fetching going stale 
when only backed-off queues remain 
([NUTCH-3114](https://issues.apache.org/jira/browse/NUTCH-3114)). With the 
state in the frontier a blocked host never reaches the fetcher in the first 
place, so that class of problem can't happen. Workers stay stateless, which is 
also the reason #1944 was closed.
   
   One subtlety worth calling out. When the first 429 arrives there may be a 
dozen URLs of that host already sitting in the fetcher's internal queues. They 
will fail one after the other, and each of those failures must not escalate the 
counter, otherwise a single incident jumps straight to the 24h cap. So: 
escalate at most once per block window. Same trick TCP uses, where multiple 
losses inside one RTT count as a single congestion event.
   
   A couple of questions before I write any code:
   
   1. Enforcement: I'd use `blockQueueUntil` with growing durations. Blocks 
expire on their own, so there is nothing to reset and no way to leave a host 
throttled forever. `setDelay` would be the gentler alternative (host stays 
crawlable, just slower), but it needs an explicit reset and `setDelay(key, 0)` 
also wipes any server-side default delay for that queue. Does anyone feel 
strongly about throttling instead of blocking?
   2. Should the headerless back-off also count fetch exceptions (timeouts, 
connection refused), like Nutch does and #1106 asks for? I'd put that behind a 
config flag, off by default.
   3. The 429'd URLs already sitting in the fetcher's queues: purge them to the 
status stream, or let them fail and reschedule on their own? Same question is 
open on #1973 , whatever we decide there applies here too.
   
   If this sounds reasonable I'll put a PR together once #1973  has landed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Send host info on a specific stream [stormcrawler]

Reply via email to