[jira] [Commented] (NUTCH-1520) SegmentMerger looses records

Markus Jelsma (JIRA) Thu, 17 Jan 2013 04:18:31 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13556111#comment-13556111
 ]


Markus Jelsma commented on NUTCH-1520:
--------------------------------------

Ah, i completely forgot about that issue! 

{quote}
I think this assumption is ok but it is necessary to preserve more than one 
(the latest) CrawlDatum:

    1. at least the latest out of {FETCH_SUCCESS, FETCH_GONE, FETCH_RETRY, 
FETCH_REDIR*}
    2. eventually the latest of FETCH_NOTMODIFIED (when re-indexing all 
segments IndexerMapReduce does not index documents with only a 
FETCH_NOTMODIFIED)
    3. possibly all linked CrawlDatums in crawl_fetch of the latest segment 
(similarily to those in crawl_parse)
{quote}

aren't 1 and 2 the same if we keep the latest hasFetchStatus() anyway? And what 
would we need the linked crawl datums for? rebuilding the crawldb?


                
> SegmentMerger looses records
> ----------------------------
>
>                 Key: NUTCH-1520
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1520
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.6
>            Reporter: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.7
>
>         Attachments: NUTCH-1520-1.7-1.patch
>
>
> It seems the SegmentMerger tool looses documents. You're likely to see less 
> documents in an index if you index one or more already merged segments than 
> if you index all unmerged segments.
> This is really nasty!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1520) SegmentMerger looses records

Reply via email to