This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git
from 582cdd417 NUTCH-3058 Fetcher: counter for hung threads (#820)
add f6bcec920 NUTCH-1806 Delegate processing of URL domains to crawler
commons - add unit test for URLs without host/domain (cf. NUTCH-2450)
add bc2ae7e0c NUTCH-1806 Delegate processing of URL domains to crawler
commons - add unit tests for host names with trailing dot ("www.apache.org.")
add e0fa35729 NUTCH-1806 Delegate processing of URL domains to crawler
commons - use methods from crawler-commons' EffectiveTldFinder in URLUtil
replacing classed and methods from the org.apache.nutch.util.domain package -
adapt and extend unit tests - add tests for
URLUtil.getTopLevelDomainName(url) - changes to the public suffix list since
2014 ("xyz" is now a public suffix / ICANN suffix) - minor API changes
- URLUtil.getDomainName(url) returns the host name [...]
add d43f5793f NUTCH-1806 Delegate processing of URL domains to crawler
commons NUTCH-1942 Remove TopLevelDomain - update DomainStatistics,
TLDIndexingFilter and domain URL filters to use the updated methods in
URLUtil - remove TLDScoringFilter - remove package org.apache.nutch.util.domain
- move DomainStatistics to org.apache.nutch.util - remove configuration files
of domain utils
add 40881e8b7 NUTCH-1806 Delegate processing of URL domains to crawler
commons
new 8b11962a4 Merge pull request #816 from
sebastian-nagel/NUTCH-1942-domain-utils-to-use-crawler-commons
The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/domain-suffixes.xml.template | 4428 --------------------
conf/domain-suffixes.xsd | 130 -
default.properties | 1 -
src/bin/nutch | 2 +-
.../nutch/util/{domain => }/DomainStatistics.java | 7 +-
src/java/org/apache/nutch/util/URLUtil.java | 214 +-
.../org/apache/nutch/util/domain/DomainSuffix.java | 78 -
.../apache/nutch/util/domain/DomainSuffixes.java | 91 -
.../nutch/util/domain/DomainSuffixesReader.java | 164 -
.../apache/nutch/util/domain/TopLevelDomain.java | 66 -
.../org/apache/nutch/util/domain/package-info.java | 28 -
.../nutch/indexer/tld/TLDIndexingFilter.java | 13 +-
.../apache/nutch/scoring/tld/TLDScoringFilter.java | 60 -
.../org/apache/nutch/scoring/tld/package-info.java | 19 -
.../nutch/urlfilter/domain/DomainURLFilter.java | 9 +-
.../domaindenylist/DomainDenylistURLFilter.java | 9 +-
src/test/org/apache/nutch/util/TestURLUtil.java | 81 +-
17 files changed, 208 insertions(+), 5192 deletions(-)
delete mode 100644 conf/domain-suffixes.xml.template
delete mode 100644 conf/domain-suffixes.xsd
rename src/java/org/apache/nutch/util/{domain => }/DomainStatistics.java (97%)
delete mode 100644 src/java/org/apache/nutch/util/domain/DomainSuffix.java
delete mode 100644 src/java/org/apache/nutch/util/domain/DomainSuffixes.java
delete mode 100644
src/java/org/apache/nutch/util/domain/DomainSuffixesReader.java
delete mode 100644 src/java/org/apache/nutch/util/domain/TopLevelDomain.java
delete mode 100644 src/java/org/apache/nutch/util/domain/package-info.java
delete mode 100644
src/plugin/tld/src/java/org/apache/nutch/scoring/tld/TLDScoringFilter.java
delete mode 100644
src/plugin/tld/src/java/org/apache/nutch/scoring/tld/package-info.java