This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 2837039 NUTCH-2855 Update org.elasticsearch.client (#577)
new c454a64 NUTCH-2858 urlnormalizer-protocol: URL port is lost during
normalization - if URL includes a port the protocol is not normalized - add
unit tests to verify correct behavior
new d749920 NUTCH-2858 urlnormalizer-protocol: URL port is lost during
normalization - add note in config file that URLs including port are not left
unchanged
new 081c826 NUTCH-2859: urlnormalizer-protocol: allow to normalize
domains - host names starting with `*.` are matched as suffixes:
`*.example.org` matches `example.org`, `www.example.org`,
`www.subdomain.example.org`, etc. - allow to read config file protocols.txt
from hdfs:// or any file system supported by Hadoop - add Javadoc package
documentation - document configuration properties in nutch-default.xml - reduce
memory footprint by deduplicating protocol strings so that [...]
new 6c02da0 Merge pull request #576 from
sebastian-nagel/NUTCH-2859-urlnormalizer-protocol-domain-rules
The 3203 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
conf/nutch-default.xml | 15 +++
conf/protocols.txt.template | 6 ++
.../basic/TestBasicURLNormalizer.java | 2 +
.../urlnormalizer-protocol/data/protocols.txt | 16 ++-
.../protocol/ProtocolURLNormalizer.java | 115 ++++++++++++++-------
.../net/urlnormalizer/protocol/package-info.java | 55 ++++++++++
.../protocol/TestProtocolURLNormalizer.java | 53 ++++++++--
7 files changed, 215 insertions(+), 47 deletions(-)
create mode 100644
src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/package-info.java