DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun dedup on a segment -------------------------------------------------------------------------------------------------
Key: NUTCH-525 URL: https://issues.apache.org/jira/browse/NUTCH-525 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Environment: Fedora OS, JDK 1.6, Hadoop FS Reporter: Vishal Shah Attachments: deleteDups.patch When trying to rerun dedup on a segment, we get the following Exception: java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883 at org.apache.lucene.util.BitVector.get(BitVector.java:72) at org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) at org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167) at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) To reproduce the error, try creating two segments with identical urls - fetch, parse, index and dedup the 2 segments. Then rerun dedup. The error comes from the DDRecordReader.next() method: //skip past deleted documents while (indexReader.isDeleted(doc) && doc < maxDoc) doc++; If the last document in the index is deleted, then this loop will skip past the last document and call indexReader.isDeleted(doc) again. The conditions should be inverted in order to fix the problem. I've attached a patch here. On a related note, why should we skip past deleted documents? The only time when this will happen is when we are rerunning dedup on a segment. If documents are not deleted for any reason other than dedup, then they should be given a chance to compete again, isn't it? We could fix this by putting an indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers