[ 
https://issues.apache.org/jira/browse/NUTCH-1616?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13718283#comment-13718283
 ] 

Markus Jelsma commented on NUTCH-1616:
--------------------------------------

Hi Sebastian!

You're right on the sorting stuff and that it can produce strange results but 
the output should be identical for each identical input over an over again if 
you ask me. The sorting issue doesn't account for the difference in reduce 
input records right?

Nevertheless i'll see if fetchTime can produce a suitable hint for multiple 
segment inputs.

Thanks!
                
> SegmentMerger missing proper crawl_fetch datum
> ----------------------------------------------
>
>                 Key: NUTCH-1616
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1616
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.8
>
>
> Merged 26036 vs. unmerged 26038 indexed documents! There are two records on 
> the merged segment that no longer have a crawl_fetch CrawlDatum with a 
> fetch_success status. Instead, the only crawl_fetch CrawlDatum has status 
> linked!
> The original segment two crawl_fetch CrawlDatums with linked and the 
> fetch_success status.
> Without the fetch_success of not_modified status it is not going to be 
> indexed.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to