This is an automated email from the ASF dual-hosted git repository.

snagel pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/nutch.git


    from 02dca3b6d NUTCH-2936 Early registration of URL stream handlers 
provided by plugins may fail Hadoop jobs running in distributed mode (#726)
     new 70b2d5e55 NUTCH-2950 Improve performance of UpdateHostDb - avoid 
needless conversion between host name and URL and back   if -filter and 
-normalize are off - URLUtil: use ROOT locale when converting host name / URL 
to lowercase
     new 5a6ac3bb1 NUTCH-2950 Improve performance of UpdateHostDb - be lazy 
creating HostDatum metaData objects:   - do not create MapWritable object 
unless needed   - use clear() instead of constructing new object     when 
reading metadata from sequence file   - use statically serialization of empty 
metaData MapWritable     as empty HostDatum metadata is the most common case    
 (stay back-ward compatible by keeping metadata mandatory)
     new 417dee6d1 NUTCH-2950 Improve performance of UpdateHostDb - simplify 
map function:   - remove instanceof conditions for key (it's an instance of 
Text     by method signature)   - avoid parsing the URL string multiple times
     new 13f8504a3 Improve performance of UpdateHostDb - parameterize logging - 
set logging level of information which is later found in the HostDb itself   to 
DEBUG (avoid that frequent log messages flood the log files) - if DNS look-ups 
are not enabled (no -check* options passed):   - do not count and log the hosts 
skipped not yet eligible for DNS look-ups   - do not create DNS resolver threads
     new 5086958be NUTCH-2950 Improve performance of UpdateHostDb - only create 
the homepage string if needed,   rely on the parsed URL to select a URL as 
homepage candidate
     new bafa752f7 Fail javadoc build on all kinds of javadoc errors and 
warnings independent from system settings
     new 947e67bef NUTCH-2950 Improve performance of UpdateHostDb - fix Javadoc 
errors / warnings
     new 47d3fe607 Merge pull request #731 from 
sebastian-nagel/NUTCH-2950-update-hostdb-performance

The 3292 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 build.xml                                          |  1 +
 src/java/org/apache/nutch/crawl/Generator.java     | 38 ++++-----
 src/java/org/apache/nutch/hostdb/HostDatum.java    | 66 +++++++++++-----
 src/java/org/apache/nutch/hostdb/ReadHostDb.java   | 29 +++----
 .../apache/nutch/hostdb/UpdateHostDbMapper.java    | 89 ++++++++++++++--------
 .../apache/nutch/hostdb/UpdateHostDbReducer.java   | 26 ++++---
 src/java/org/apache/nutch/util/URLUtil.java        | 41 +++++++++-
 7 files changed, 198 insertions(+), 92 deletions(-)

Reply via email to