Sami Siren wrote:
Doğacan Güney wrote:
Hi,

There seems to be a problem with current nutch svn. If you fetch with -noParsing option, then parse the segment, all urls have the same parse_text(which is the parse_text of the first url).

In ParseSegment's map function:
Content content = (Content) value;

If you check the content after this line, it seems to be same for all keys.

Does anyone else have this problem?

Yes, this was an unfortunate side effect of my optimization efforts, please try the attached patch if it works for you.
That works just fine. Thanks!


--
 Sami Siren
------------------------------------------------------------------------

Index: src/java/org/apache/nutch/protocol/Content.java
===================================================================
--- src/java/org/apache/nutch/protocol/Content.java     (revision 475295)
+++ src/java/org/apache/nutch/protocol/Content.java     (working copy)
@@ -298,4 +298,12 @@
     return typeName;
   }
+ /**
+   * By calling this method one ensures that on next read/write to any property
+   * parent object is consulted to check if decompressing of data is required.
+   */
+  public void forceInflate() {
+    inflated = false;
+  }
+
 }
Index: src/java/org/apache/nutch/parse/ParseSegment.java
===================================================================
--- src/java/org/apache/nutch/parse/ParseSegment.java   (revision 475295)
+++ src/java/org/apache/nutch/parse/ParseSegment.java   (working copy)
@@ -66,8 +66,9 @@
       newKey.set(key.toString());
       key = newKey;
     }
-    Content content = (Content)value;
-
+    Content content = (Content) value;
+    content.forceInflate();
+ Parse parse = null;
     ParseStatus status;
     try {

Reply via email to