GitHub user rzo1 added a comment to the discussion: Storm crawler not honouring crawl delay
You do have a 15-minute delay configured: - fetcher.server.delay: 900.0 - fetcher.server.min.delay: 900.0 but that delay is enforced per queue key, not globally per domain. In StormCrawler, politeness is guaranteed only if all URLs for a host end up in the same FetcherBolt queue. If they don’t, parallel fetchers will happily fetch the same host in parallel and you’ll see requests much sooner than 15 minutes apart. The delay is enforced inside each FetcherBolt instance, which is subject to `partition.url.mode: "byHost". So you would need to check, if you are really looking into the same `hostname`. To debug what is causing it, you could reduce the parallelismn to 1 and check the actual queues. GitHub link: https://github.com/apache/stormcrawler/discussions/1808#discussioncomment-15754826 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
