sebastian-nagel opened a new pull request #576:
URL: https://github.com/apache/nutch/pull/576


   (note: to avoid merge conflicts, this PR includes #575)
   
   In order to address NUTCH-2859 host names starting with `*.` (in the config 
file `protocols.txt` or in the string rules) are matched as suffixes: 
`*.example.org` matches `example.org`, `www.example.org`, 
`www.subdomain.example.org`, etc.
   
   Additional improvements:
   - allow to read config file protocols.txt from hdfs:// or any file system 
supported by Hadoop - useful if the list of host or domains requiring 
normalization is large or changes often
   - add Javadoc package documentation
   - document configuration properties in nutch-default.xml
   - reduce the memory footprint by deduplicating protocol strings, so that 
same protocol values are references to same objects


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to