Couple of points:
1. You used tabs
2. You left some unneccessary comments on source, bug history is
allready in jira and commit logs
3. Why not addition to testcase?
4. Issue could have been iterated in jira a bit further so all these
could have been catched before a commit.
--
Sami Siren
Chris A. Mattmann (JIRA) wrote:
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ]
Chris A. Mattmann closed NUTCH-406.
-----------------------------------
Patch applied to trunk:
http://svn.apache.org/viewvc?view=rev&revision=478619
Metadata tries to write null values
-----------------------------------
Key: NUTCH-406
URL: http://issues.apache.org/jira/browse/NUTCH-406
Project: Nutch
Issue Type: Bug
Affects Versions: 0.9.0
Reporter: Doğacan Güney
Assigned To: Chris A. Mattmann
Fix For: 0.9.0
Attachments: NUTCH-406.patch, NUTCH-406.patch
During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta.
When Metadata.write() tries to write such a pair, it causes an NPE.
Stack trace will be something like this:
at org.apache.hadoop.io.Text.encode(Text.java:373)
at org.apache.hadoop.io.Text.encode(Text.java:354)
at org.apache.hadoop.io.Text.writeString(Text.java:394)
at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
I can consistently reproduce this using the following url:
http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf