[ 
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601314#comment-14601314
 ] 

Sebastian Nagel commented on NUTCH-1625:
----------------------------------------

Segments have been indexed within one single indexing job? There is still 
NUTCH-1416 which does not allow to index safely multiple segments with 
overlapping content because no ordering of items (CrawlDatums, ParseData, etc.) 
with same key/URL is guaranteed. Some classes we cannot keep in order without 
wrapping them with the segment name (as done by SegmentMerger), for others we 
could re-establish the order, see NUTCH-1617. I'll re-open NUTCH-1416 - should 
be not too hard to fix and it's really annoying if there is no simple way to 
re-index a bunch of segments. If we get it fixed your patch is a necessary 
part. 

> IndexerMapReduce skips FETCH_NOTMODIFIED
> ----------------------------------------
>
>                 Key: NUTCH-1625
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1625
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.7
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Critical
>             Fix For: 1.11
>
>         Attachments: NUTCH-1625.patch, NUTCH-1625.patch
>
>
> IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also 
> skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check 
> but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED 
> isn't very useful anyway because since 1.5 orso we can safely rely on 
> DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to