[ http://issues.apache.org/jira/browse/NUTCH-406?page=comments#action_12452282 ] Andrzej Bialecki commented on NUTCH-406: -----------------------------------------
Erhm, -1 from me. This code checks only if the first value is null, and then discards all other values (which may be non-null), thus we could lose valuable data if only the first value happens to be null ... I think we should indeed check if the first value is null, but then if it is then loop over all other values, count non-nulls, and if the count > 0 then write out the <key, <non-null values>> set. > Metadata tries to write null values > ----------------------------------- > > Key: NUTCH-406 > URL: http://issues.apache.org/jira/browse/NUTCH-406 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Reporter: Doğacan Güney > Assigned To: Chris A. Mattmann > Attachments: NUTCH-406.patch, NUTCH-406.patch > > > During parsing, some urls (especially pdfs, it seems) may create <some_key, > null> pairs in ParseData's parseMeta. > When Metadata.write() tries to write such a pair, it causes an NPE. > Stack trace will be something like this: > at org.apache.hadoop.io.Text.encode(Text.java:373) > at org.apache.hadoop.io.Text.encode(Text.java:354) > at org.apache.hadoop.io.Text.writeString(Text.java:394) > at org.apache.nutch.metadata.Metadata.write(Metadata.java:214) > I can consistently reproduce this using the following url: > http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira