This is an automated email from the ASF dual-hosted git repository. snagel pushed a change to branch master in repository https://gitbox.apache.org/repos/asf/nutch.git
from 02dca3b6d NUTCH-2936 Early registration of URL stream handlers provided by plugins may fail Hadoop jobs running in distributed mode (#726) new 70b2d5e55 NUTCH-2950 Improve performance of UpdateHostDb - avoid needless conversion between host name and URL and back if -filter and -normalize are off - URLUtil: use ROOT locale when converting host name / URL to lowercase new 5a6ac3bb1 NUTCH-2950 Improve performance of UpdateHostDb - be lazy creating HostDatum metaData objects: - do not create MapWritable object unless needed - use clear() instead of constructing new object when reading metadata from sequence file - use statically serialization of empty metaData MapWritable as empty HostDatum metadata is the most common case (stay back-ward compatible by keeping metadata mandatory) new 417dee6d1 NUTCH-2950 Improve performance of UpdateHostDb - simplify map function: - remove instanceof conditions for key (it's an instance of Text by method signature) - avoid parsing the URL string multiple times new 13f8504a3 Improve performance of UpdateHostDb - parameterize logging - set logging level of information which is later found in the HostDb itself to DEBUG (avoid that frequent log messages flood the log files) - if DNS look-ups are not enabled (no -check* options passed): - do not count and log the hosts skipped not yet eligible for DNS look-ups - do not create DNS resolver threads new 5086958be NUTCH-2950 Improve performance of UpdateHostDb - only create the homepage string if needed, rely on the parsed URL to select a URL as homepage candidate new bafa752f7 Fail javadoc build on all kinds of javadoc errors and warnings independent from system settings new 947e67bef NUTCH-2950 Improve performance of UpdateHostDb - fix Javadoc errors / warnings new 47d3fe607 Merge pull request #731 from sebastian-nagel/NUTCH-2950-update-hostdb-performance The 3292 revisions listed above as "new" are entirely new to this repository and will be described in separate emails. The revisions listed as "add" were already present in the repository and have only been added to this reference. Summary of changes: build.xml | 1 + src/java/org/apache/nutch/crawl/Generator.java | 38 ++++----- src/java/org/apache/nutch/hostdb/HostDatum.java | 66 +++++++++++----- src/java/org/apache/nutch/hostdb/ReadHostDb.java | 29 +++---- .../apache/nutch/hostdb/UpdateHostDbMapper.java | 89 ++++++++++++++-------- .../apache/nutch/hostdb/UpdateHostDbReducer.java | 26 ++++--- src/java/org/apache/nutch/util/URLUtil.java | 41 +++++++++- 7 files changed, 198 insertions(+), 92 deletions(-)