DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to rerun 
dedup on a segment
-------------------------------------------------------------------------------------------------

                 Key: NUTCH-525
                 URL: https://issues.apache.org/jira/browse/NUTCH-525
             Project: Nutch
          Issue Type: Bug
    Affects Versions: 0.9.0
         Environment: Fedora OS, JDK 1.6, Hadoop FS
            Reporter: Vishal Shah
         Attachments: deleteDups.patch

When trying to rerun dedup on a segment, we get the following Exception:

java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
        at org.apache.lucene.util.BitVector.get(BitVector.java:72)
        at 
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
        at 
org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
        at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
        at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)

To reproduce the error, try creating two segments with identical urls - fetch, 
parse, index and dedup the 2 segments. Then rerun dedup.

The error comes from the DDRecordReader.next() method:

//skip past deleted documents
while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;

If the last document in the index is deleted, then this loop will skip past the 
last document and call indexReader.isDeleted(doc) again.

The conditions should be inverted in order to fix the problem.

I've attached a patch here.


On a related note, why should we skip past deleted documents? The only time 
when this will happen is when we are rerunning dedup on a segment. If documents 
are not deleted for any reason other than dedup, then they should be given a 
chance to compete again, isn't it? We could fix this by putting an 
indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to