[ https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514914 ]
Andrzej Bialecki commented on NUTCH-525: ----------------------------------------- +1 for adding undeleteAll(). When DDRecordReader was created, this call was ommited to account for possible existing deletions as a result of running PruneIndexTool - however, this seems to create more problems than it's worth. In any case, it's better to run PruneIndexTool after deduplication and merging. > DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to > rerun dedup on a segment > ------------------------------------------------------------------------------------------------- > > Key: NUTCH-525 > URL: https://issues.apache.org/jira/browse/NUTCH-525 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.9.0 > Environment: Fedora OS, JDK 1.6, Hadoop FS > Reporter: Vishal Shah > Attachments: deleteDups.patch > > > When trying to rerun dedup on a segment, we get the following Exception: > java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883 > at org.apache.lucene.util.BitVector.get(BitVector.java:72) > at > org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346) > at > org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167) > at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157) > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175) > at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445) > To reproduce the error, try creating two segments with identical urls - > fetch, parse, index and dedup the 2 segments. Then rerun dedup. > The error comes from the DDRecordReader.next() method: > //skip past deleted documents > while (indexReader.isDeleted(doc) && doc < maxDoc) doc++; > If the last document in the index is deleted, then this loop will skip past > the last document and call indexReader.isDeleted(doc) again. > The conditions should be inverted in order to fix the problem. > I've attached a patch here. > On a related note, why should we skip past deleted documents? The only time > when this will happen is when we are rerunning dedup on a segment. If > documents are not deleted for any reason other than dedup, then they should > be given a chance to compete again, isn't it? We could fix this by putting an > indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts > on this? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Still grepping through log files to find problems? Stop. Now Search log events and configuration files using AJAX and a browser. Download your FREE copy of Splunk now >> http://get.splunk.com/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers