[ 
https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556059#comment-13556059
 ] 

Markus Jelsma commented on NUTCH-1520:
--------------------------------------

The problem has its root in SegmentMerger preferring CrawlDatums from never 
segments without looking at the fetch status. The problem is easily reproduced:

1. fetch url_A
2. fetch url_B
3. fetch url_A

url_A redirects to url_B. In the second segment url_B has a FETCH_SUCCESS 
status which is why it's being indexed. However, url_B has a LINKED status in 
the first and third segment. Because the CrawlDatum in the third segment 
prevails we end up with just a LINKED status for url_B, which is why it's not 
getting indexed.

Any ideas on how to proceed? Should we prefer CrawlDatum's with a fetch status 
despite they're older?
                
> SegmentMerger looses records
> ----------------------------
>
>                 Key: NUTCH-1520
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1520
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.7
>
>
> It seems the SegmentMerger tool looses documents. You're likely to see less 
> documents in an index if you index one or more already merged segments than 
> if you index all unmerged segments.
> This is really nasty!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to