Gal Nitzan wrote:

Sorry :) no.


Hmm. ok. :) But I think that patch is needed anyway, because now we silently assume that parse plugins will always copy all Content metadata to ParseData.metadata, while it may not be the case - and it certainly does not happen if there is a parse error ... and this patch fixes it. Later on, Indexer tries to retrieve these values from parseData.metadata, and not from the content.metadata (because we try to avoid reading too much data, so the content part of a segment is not accessed during indexing).

I run fetcher with parse.

This NPE  happens for only a few documents and that is the problem :)

Ok, then I think I know what is going on... Please try this patch - that's the same problem, actually: these few documents failed to parse, and we got an empty parseData - but in this case it means also empty metadata, which means no segment name nor score in parseData.metadata.

Please test and report if it helps.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Index: Fetcher.java
===================================================================
--- Fetcher.java        (revision 367099)
+++ Fetcher.java        (working copy)
@@ -223,6 +223,9 @@
         parse.getData().getMetadata().setProperty(SIGNATURE_KEY, 
StringUtil.toHexString(signature));
         datum.setSignature(signature);
       }
+      // add segment name and score to parseData metadata
+      parse.getData().getMetadata().setProperty(SEGMENT_NAME_KEY, segmentName);
+      parse.getData().getMetadata().setProperty(SCORE_KEY, 
Float.toString(datum.getScore()));
 
       try {
         output.collect

Reply via email to