[
https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556059#comment-13556059
]
Markus Jelsma commented on NUTCH-1520:
--------------------------------------
The problem has its root in SegmentMerger preferring CrawlDatums from never
segments without looking at the fetch status. The problem is easily reproduced:
1. fetch url_A
2. fetch url_B
3. fetch url_A
url_A redirects to url_B. In the second segment url_B has a FETCH_SUCCESS
status which is why it's being indexed. However, url_B has a LINKED status in
the first and third segment. Because the CrawlDatum in the third segment
prevails we end up with just a LINKED status for url_B, which is why it's not
getting indexed.
Any ideas on how to proceed? Should we prefer CrawlDatum's with a fetch status
despite they're older?
> SegmentMerger looses records
> ----------------------------
>
> Key: NUTCH-1520
> URL: https://issues.apache.org/jira/browse/NUTCH-1520
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.6
> Reporter: Markus Jelsma
> Priority: Critical
> Fix For: 1.7
>
>
> It seems the SegmentMerger tool looses documents. You're likely to see less
> documents in an index if you index one or more already merged segments than
> if you index all unmerged segments.
> This is really nasty!
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira