[ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514915
 ] 

Doğacan Güney commented on NUTCH-525:
-------------------------------------

OK, I can see why undelete is useful. But I still think that we should make it 
optional (via a command line parameter). There may be people out there who 
delete documents without using PruneIndexTool. It would be extremely weird for 
them if the documents they just deleted suddenly 'pop up' in their search 
results.

> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
> rerun dedup on a segment
> -------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-525
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: Fedora OS, JDK 1.6, Hadoop FS
>            Reporter: Vishal Shah
>         Attachments: deleteDups.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
>       at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>       at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>       at 
> org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
>       at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - 
> fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past 
> the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning dedup on a segment. If 
> documents are not deleted for any reason other than dedup, then they should 
> be given a chance to compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
> on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to