GitHub user rzo1 added a comment to the discussion: Storm crawler not honouring crawl delay
`byDomain` relies on the effective tld finder of crawler-commons, which basically uses the [PSL ](https://github.com/publicsuffix/list/blob/main/public_suffix_list.dat) as a datasource. It is exactly the list maintained at https://publicsuffix.org/ , used by browsers and libraries to know which parts of a domain are publicly registrable versus under registry control. So if a domain is not contained in the PSL (or only parts of), the delay might not be enforced as expected. - https://github.com/crawler-commons/crawler-commons/blob/master/src/main/java/crawlercommons/domains/PaidLevelDomain.java - https://github.com/crawler-commons/crawler-commons/blob/master/src/main/java/crawlercommons/domains/EffectiveTldFinder.java GitHub link: https://github.com/apache/stormcrawler/discussions/1808#discussioncomment-15764944 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
