[ https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13700353#comment-13700353 ]
Sebastian Nagel commented on NUTCH-1520: ---------------------------------------- Hi [~markus17], you are right. If we assume that segments to be merged have already been used to update CrawlDb, there is no need to keep linked CrawlDatums and notmodified fetch datums. I've tested the patch with test data from NUTCH-1113: index is identical if build from original segments and the one merged segment. +1 to commit and resolve also NUTCH-1113. > SegmentMerger looses records > ---------------------------- > > Key: NUTCH-1520 > URL: https://issues.apache.org/jira/browse/NUTCH-1520 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.6 > Reporter: Markus Jelsma > Priority: Critical > Fix For: 1.9 > > Attachments: NUTCH-1520-1.7-1.patch > > > It seems the SegmentMerger tool looses documents. You're likely to see less > documents in an index if you index one or more already merged segments than > if you index all unmerged segments. > This is really nasty! -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira