This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git.
from 2837039 NUTCH-2855 Update org.elasticsearch.client (#577) new c454a64 NUTCH-2858 urlnormalizer-protocol: URL port is lost during normalization - if URL includes a port the protocol is not normalized - add unit tests to verify correct behavior new d749920 NUTCH-2858 urlnormalizer-protocol: URL port is lost during normalization - add note in config file that URLs including port are not left unchanged new 081c826 NUTCH-2859: urlnormalizer-protocol: allow to normalize domains - host names starting with `*.` are matched as suffixes: `*.example.org` matches `example.org`, `www.example.org`, `www.subdomain.example.org`, etc. - allow to read config file protocols.txt from hdfs:// or any file system supported by Hadoop - add Javadoc package documentation - document configuration properties in nutch-default.xml - reduce memory footprint by deduplicating protocol strings so that [...] new 6c02da0 Merge pull request #576 from sebastian-nagel/NUTCH-2859-urlnormalizer-protocol-domain-rules The 3203 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: conf/nutch-default.xml | 15 +++ conf/protocols.txt.template | 6 ++ .../basic/TestBasicURLNormalizer.java | 2 + .../urlnormalizer-protocol/data/protocols.txt | 16 ++- .../protocol/ProtocolURLNormalizer.java | 115 ++++++++++++++------- .../net/urlnormalizer/protocol/package-info.java | 55 ++++++++++ .../protocol/TestProtocolURLNormalizer.java | 53 ++++++++-- 7 files changed, 215 insertions(+), 47 deletions(-) create mode 100644 src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/package-info.java