DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun
dedup on a segment
-------------------------------------------------------------------------------------------------
Key: NUTCH-525
URL: https://issues.apache.org/jira/browse/NUTCH-525
Project: Nutch
Issue Type: Bug
Affects Versions: 0.9.0
Environment: Fedora OS, JDK 1.6, Hadoop FS
Reporter: Vishal Shah
Attachments: deleteDups.patch
When trying to rerun dedup on a segment, we get the following Exception:
java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
at org.apache.lucene.util.BitVector.get(BitVector.java:72)
at
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
at
org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
To reproduce the error, try creating two segments with identical urls - fetch,
parse, index and dedup the 2 segments. Then rerun dedup.
The error comes from the DDRecordReader.next() method:
//skip past deleted documents
while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
If the last document in the index is deleted, then this loop will skip past the
last document and call indexReader.isDeleted(doc) again.
The conditions should be inverted in order to fix the problem.
I've attached a patch here.
On a related note, why should we skip past deleted documents? The only time
when this will happen is when we are rerunning dedup on a segment. If documents
are not deleted for any reason other than dedup, then they should be given a
chance to compete again, isn't it? We could fix this by putting an
indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts
on this?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems? Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >> http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers