This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git.


    from 2837039  NUTCH-2855 Update org.elasticsearch.client (#577)
     new c454a64  NUTCH-2858 urlnormalizer-protocol: URL port is lost during 
normalization - if URL includes a port the protocol is not normalized - add 
unit tests to verify correct behavior
     new d749920  NUTCH-2858 urlnormalizer-protocol: URL port is lost during 
normalization - add note in config file that URLs including port are not left   
unchanged
     new 081c826  NUTCH-2859: urlnormalizer-protocol: allow to normalize 
domains - host names starting with `*.` are matched as suffixes:   
`*.example.org` matches `example.org`, `www.example.org`,   
`www.subdomain.example.org`, etc. - allow to read config file protocols.txt 
from hdfs://   or any file system supported by Hadoop - add Javadoc package 
documentation - document configuration properties in nutch-default.xml - reduce 
memory footprint by deduplicating protocol strings   so that  [...]
     new 6c02da0  Merge pull request #576 from 
sebastian-nagel/NUTCH-2859-urlnormalizer-protocol-domain-rules

The 3203 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 conf/nutch-default.xml                             |  15 +++
 conf/protocols.txt.template                        |   6 ++
 .../basic/TestBasicURLNormalizer.java              |   2 +
 .../urlnormalizer-protocol/data/protocols.txt      |  16 ++-
 .../protocol/ProtocolURLNormalizer.java            | 115 ++++++++++++++-------
 .../net/urlnormalizer/protocol/package-info.java   |  55 ++++++++++
 .../protocol/TestProtocolURLNormalizer.java        |  53 ++++++++--
 7 files changed, 215 insertions(+), 47 deletions(-)
 create mode 100644 
src/plugin/urlnormalizer-protocol/src/java/org/apache/nutch/net/urlnormalizer/protocol/package-info.java

Reply via email to