[ https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453870#comment-16453870 ]
ASF GitHub Bot commented on NUTCH-2572: --------------------------------------- sebastian-nagel closed pull request #326: NUTCH-2572 HostDb: updatehostdb does not set values URL: https://github.com/apache/nutch/pull/326 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java index 34a51037e..21c847db8 100644 --- a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java +++ b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java @@ -134,7 +134,8 @@ public void reduce(Text key, Iterable<NutchWritable> values, // Loop through all values until we find a non-empty HostDatum or use // an empty if this is a new host for the host db - for (Writable value : values) { + for (NutchWritable val : values) { + final Writable value = val.get(); // unwrap // Count crawl datum status's and collect metadata from fields if (value instanceof CrawlDatum) { @@ -260,7 +261,7 @@ public void reduce(Text key, Iterable<NutchWritable> values, } // - if (value instanceof HostDatum) { + else if (value instanceof HostDatum) { HostDatum buffer = (HostDatum)value; // Check homepage URL @@ -295,9 +296,11 @@ public void reduce(Text key, Iterable<NutchWritable> values, } // Check for the score - if (value instanceof FloatWritable) { + else if (value instanceof FloatWritable) { FloatWritable buffer = (FloatWritable)value; score = buffer.get(); + } else { + LOG.error("Class {} not handled", value.getClass()); } } ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > HostDb: updatehostdb does not set values > ---------------------------------------- > > Key: NUTCH-2572 > URL: https://issues.apache.org/jira/browse/NUTCH-2572 > Project: Nutch > Issue Type: Bug > Components: hostdb > Affects Versions: 1.15 > Reporter: Sebastian Nagel > Assignee: Sebastian Nagel > Priority: Major > Fix For: 1.15 > > > {noformat} > % bin/nutch readdb crawl/crawldb -stats -sort > ... > status 1 (db_unfetched): 3 > nutch.apache.org : 3 > status 2 (db_fetched): 2 > nutch.apache.org : 2 > status 6 (db_notmodified): 34 > nutch.apache.org : 34 > CrawlDb statistics: done > % bin/nutch updatehostdb -hostdb crawl/hostdb -crawldb crawl/crawldb > UpdateHostDb: hostdb: crawl/hostdb > UpdateHostDb: crawldb: crawl/crawldb > UpdateHostDb: starting at 2018-04-23 13:50:33 > UpdateHostDb: finished at 2018-04-23 13:50:35, elapsed: 00:00:01 > % bin/nutch readhostdb crawl/hostdb -get nutch.apache.org > ReadHostDb: get: nutch.apache.org > 0 0 0 0 0 0 0 0 0 0 > 0.0 1970-01-01 01:00:00 > {noformat} > Although a HostDb record is added for "nutch.apache.org", all expected values > (number of fetched/unfetched/... pages, fetch time > min/max/average/percentiles, etc.) are empty or zero. -- This message was sent by Atlassian JIRA (v7.6.3#76005)