[ 
https://issues.apache.org/jira/browse/NUTCH-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16453870#comment-16453870
 ] 

ASF GitHub Bot commented on NUTCH-2572:
---------------------------------------

sebastian-nagel closed pull request #326: NUTCH-2572 HostDb: updatehostdb does 
not set values
URL: https://github.com/apache/nutch/pull/326
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java 
b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
index 34a51037e..21c847db8 100644
--- a/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
+++ b/src/java/org/apache/nutch/hostdb/UpdateHostDbReducer.java
@@ -134,7 +134,8 @@ public void reduce(Text key, Iterable<NutchWritable> values,
     
     // Loop through all values until we find a non-empty HostDatum or use
     // an empty if this is a new host for the host db
-    for (Writable value : values) {
+    for (NutchWritable val : values) {
+      final Writable value = val.get(); // unwrap
       
       // Count crawl datum status's and collect metadata from fields
       if (value instanceof CrawlDatum) {
@@ -260,7 +261,7 @@ public void reduce(Text key, Iterable<NutchWritable> values,
       }
       
       // 
-      if (value instanceof HostDatum) {
+      else if (value instanceof HostDatum) {
         HostDatum buffer = (HostDatum)value;
 
         // Check homepage URL
@@ -295,9 +296,11 @@ public void reduce(Text key, Iterable<NutchWritable> 
values,
       }
 
       // Check for the score
-      if (value instanceof FloatWritable) {
+      else if (value instanceof FloatWritable) {
         FloatWritable buffer = (FloatWritable)value;
         score = buffer.get();
+      } else {
+        LOG.error("Class {} not handled", value.getClass());
       }
     }
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> HostDb: updatehostdb does not set values
> ----------------------------------------
>
>                 Key: NUTCH-2572
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2572
>             Project: Nutch
>          Issue Type: Bug
>          Components: hostdb
>    Affects Versions: 1.15
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Major
>             Fix For: 1.15
>
>
> {noformat}
> % bin/nutch readdb crawl/crawldb -stats -sort
> ...
> status 1 (db_unfetched):        3
>    nutch.apache.org :   3
> status 2 (db_fetched):  2
>    nutch.apache.org :   2
> status 6 (db_notmodified):      34
>    nutch.apache.org :   34
> CrawlDb statistics: done
> % bin/nutch updatehostdb -hostdb  crawl/hostdb -crawldb crawl/crawldb
> UpdateHostDb: hostdb: crawl/hostdb
> UpdateHostDb: crawldb: crawl/crawldb
> UpdateHostDb: starting at 2018-04-23 13:50:33
> UpdateHostDb: finished at 2018-04-23 13:50:35, elapsed: 00:00:01
> % bin/nutch readhostdb crawl/hostdb -get nutch.apache.org
> ReadHostDb: get: nutch.apache.org
> 0       0       0       0       0       0       0       0       0       0     
>   0.0     1970-01-01 01:00:00
> {noformat}
> Although a HostDb record is added for "nutch.apache.org", all expected values 
> (number of fetched/unfetched/... pages, fetch time 
> min/max/average/percentiles, etc.) are empty or zero.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to