sebastian-nagel opened a new pull request, #816:
URL: https://github.com/apache/nutch/pull/816
and NUTCH-1942 Remove TopLevelDomain
- use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing
classed and methods from the "org.apache.nutch.util.domain" package
- adapt and extend unit tests
- add tests for URLUtil.getTopLevelDomainName(url)
- reflect changes to the public suffix list since 2014 ("xyz" is now a
public suffix / ICANN suffix)
- adapt to minor API changes
- URLUtil.getDomainName(url) returns the host name in case no valid
public suffix is found
- for Unicode suffixes and TLDs the methods
URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now
return the ASCII representation
- add unit tests for host names with trailing dot ("www.apache.org.")
- add add unit test for URLs without host/domain (cf. NUTCH-2450)unit
test for URLs without host/domain (cf. NUTCH-2450)
- update and complete Javadoc
- update DomainStatistics, TLDIndexingFilter and domain URL filters to use
the updated methods in URLUtil
- remove the class TLDScoringFilter. The configuration is bound to the
domain-suffixes.xml which wasn't maintained anymore and is now removed
- remove package org.apache.nutch.util.domain
- move DomainStatistics to org.apache.nutch.util
- remove configuration files of domain utils
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]