This is an automated email from the ASF dual-hosted git repository.
snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git
from 02dca3b6d NUTCH-2936 Early registration of URL stream handlers
provided by plugins may fail Hadoop jobs running in distributed mode (#726)
new 70b2d5e55 NUTCH-2950 Improve performance of UpdateHostDb - avoid
needless conversion between host name and URL and back if -filter and
-normalize are off - URLUtil: use ROOT locale when converting host name / URL
to lowercase
new 5a6ac3bb1 NUTCH-2950 Improve performance of UpdateHostDb - be lazy
creating HostDatum metaData objects: - do not create MapWritable object
unless needed - use clear() instead of constructing new object when
reading metadata from sequence file - use statically serialization of empty
metaData MapWritable as empty HostDatum metadata is the most common case
(stay back-ward compatible by keeping metadata mandatory)
new 417dee6d1 NUTCH-2950 Improve performance of UpdateHostDb - simplify
map function: - remove instanceof conditions for key (it's an instance of
Text by method signature) - avoid parsing the URL string multiple times
new 13f8504a3 Improve performance of UpdateHostDb - parameterize logging -
set logging level of information which is later found in the HostDb itself to
DEBUG (avoid that frequent log messages flood the log files) - if DNS look-ups
are not enabled (no -check* options passed): - do not count and log the hosts
skipped not yet eligible for DNS look-ups - do not create DNS resolver threads
new 5086958be NUTCH-2950 Improve performance of UpdateHostDb - only create
the homepage string if needed, rely on the parsed URL to select a URL as
homepage candidate
new bafa752f7 Fail javadoc build on all kinds of javadoc errors and
warnings independent from system settings
new 947e67bef NUTCH-2950 Improve performance of UpdateHostDb - fix Javadoc
errors / warnings
new 47d3fe607 Merge pull request #731 from
sebastian-nagel/NUTCH-2950-update-hostdb-performance
The 3292 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails. The revisions
listed as "add" were already present in the repository and have only
been added to this reference.
Summary of changes:
build.xml | 1 +
src/java/org/apache/nutch/crawl/Generator.java | 38 ++++-----
src/java/org/apache/nutch/hostdb/HostDatum.java | 66 +++++++++++-----
src/java/org/apache/nutch/hostdb/ReadHostDb.java | 29 +++----
.../apache/nutch/hostdb/UpdateHostDbMapper.java | 89 ++++++++++++++--------
.../apache/nutch/hostdb/UpdateHostDbReducer.java | 26 ++++---
src/java/org/apache/nutch/util/URLUtil.java | 41 +++++++++-
7 files changed, 198 insertions(+), 92 deletions(-)