[ 
https://issues.apache.org/jira/browse/NUTCH-525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12514903
 ] 

Doğacan Güney commented on NUTCH-525:
-------------------------------------

Nice patch. Could you also add a unit test? It is enough if you add a test that 
adds the same url to two segments then tries to dedup them (like what you are 
doing to reproduce the bug).

> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning 
> dedup on a segment. If documents are not deleted for any reason other than 
> dedup, then they should be given a chance to 
> compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. 
> Any thoughts on this?

Why would we want deleted documents to compete again? They are deleted from the 
index because we either have a more recent version of the same page, or another 
more important page contains the same content. So, since we already have the 
'better' page in our index, I don't see how undelete helps...

> DeleteDuplicates generates ArrayIndexOutOfBoundsException when trying to 
> rerun dedup on a segment
> -------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-525
>                 URL: https://issues.apache.org/jira/browse/NUTCH-525
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.9.0
>         Environment: Fedora OS, JDK 1.6, Hadoop FS
>            Reporter: Vishal Shah
>         Attachments: deleteDups.patch
>
>
> When trying to rerun dedup on a segment, we get the following Exception:
> java.lang.ArrayIndexOutOfBoundsException: Array index out of range: 261883
>       at org.apache.lucene.util.BitVector.get(BitVector.java:72)
>       at 
> org.apache.lucene.index.SegmentReader.isDeleted(SegmentReader.java:346)
>       at 
> org.apache.nutch.indexer.DeleteDuplicates1$InputFormat$DDRecordReader.next(DeleteDuplicates1.java:167)
>       at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>       at 
> org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1445)
> To reproduce the error, try creating two segments with identical urls - 
> fetch, parse, index and dedup the 2 segments. Then rerun dedup.
> The error comes from the DDRecordReader.next() method:
> //skip past deleted documents
> while (indexReader.isDeleted(doc) && doc < maxDoc) doc++;
> If the last document in the index is deleted, then this loop will skip past 
> the last document and call indexReader.isDeleted(doc) again.
> The conditions should be inverted in order to fix the problem.
> I've attached a patch here.
> On a related note, why should we skip past deleted documents? The only time 
> when this will happen is when we are rerunning dedup on a segment. If 
> documents are not deleted for any reason other than dedup, then they should 
> be given a chance to compete again, isn't it? We could fix this by putting an 
> indexReader.undeleteAll() in the constructor for DDRecordReader. Any thoughts 
> on this?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to