[ 
https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700353#comment-13700353
 ] 

Sebastian Nagel commented on NUTCH-1520:
----------------------------------------

Hi [~markus17], you are right. If we assume that segments to be merged have 
already been used to update CrawlDb, there is no need to keep linked 
CrawlDatums and notmodified fetch datums. I've tested the patch with test data 
from NUTCH-1113: index is identical if build from original segments and the one 
merged segment.
+1 to commit and resolve also NUTCH-1113. 
                
> SegmentMerger looses records
> ----------------------------
>
>                 Key: NUTCH-1520
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1520
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.9
>
>         Attachments: NUTCH-1520-1.7-1.patch
>
>
> It seems the SegmentMerger tool looses documents. You're likely to see less 
> documents in an index if you index one or more already merged segments than 
> if you index all unmerged segments.
> This is really nasty!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to