GitHub user rzo1 edited a comment on the discussion: Storm crawler not 
honouring crawl delay

You do have a 15-minute delay configured:

- fetcher.server.delay: 900.0
- fetcher.server.min.delay: 900.0

but that delay is enforced per queue key, not globally per domain. In 
StormCrawler, politeness is guaranteed only if all URLs for a host end up in 
the same FetcherBolt queue. If they don’t, parallel fetchers will happily fetch 
the same host in parallel and you’ll see requests much sooner than 15 minutes 
apart.

The delay is enforced inside each FetcherBolt instance, which is subject to 
`partition.url.mode: "byHost"`. So you would need to check, if you are really 
looking into the same `hostname`. To debug what is causing it, you could reduce 
the parallelismn to 1 and check the actual queues.



GitHub link: 
https://github.com/apache/stormcrawler/discussions/1808#discussioncomment-15754826

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to