GitHub user rzo1 added a comment to the discussion: Storm crawler not honouring 
crawl delay

`byDomain` relies on the effective tld finder of crawler-commons, which 
basically uses the [PSL 
](https://github.com/publicsuffix/list/blob/main/public_suffix_list.dat) as a 
datasource. It is exactly the list maintained at https://publicsuffix.org/ , 
used by browsers and libraries to know which parts of a domain are publicly 
registrable versus under registry control. So if a domain is not contained in 
the PSL (or only parts of), the delay might not be enforced as expected.

- 
https://github.com/crawler-commons/crawler-commons/blob/master/src/main/java/crawlercommons/domains/PaidLevelDomain.java
- 
https://github.com/crawler-commons/crawler-commons/blob/master/src/main/java/crawlercommons/domains/EffectiveTldFinder.java

GitHub link: 
https://github.com/apache/stormcrawler/discussions/1808#discussioncomment-15764944

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to