[
https://issues.apache.org/jira/browse/NUTCH-1625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14601314#comment-14601314
]
Sebastian Nagel commented on NUTCH-1625:
----------------------------------------
Segments have been indexed within one single indexing job? There is still
NUTCH-1416 which does not allow to index safely multiple segments with
overlapping content because no ordering of items (CrawlDatums, ParseData, etc.)
with same key/URL is guaranteed. Some classes we cannot keep in order without
wrapping them with the segment name (as done by SegmentMerger), for others we
could re-establish the order, see NUTCH-1617. I'll re-open NUTCH-1416 - should
be not too hard to fix and it's really annoying if there is no simple way to
re-index a bunch of segments. If we get it fixed your patch is a necessary
part.
> IndexerMapReduce skips FETCH_NOTMODIFIED
> ----------------------------------------
>
> Key: NUTCH-1625
> URL: https://issues.apache.org/jira/browse/NUTCH-1625
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.7
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Critical
> Fix For: 1.11
>
> Attachments: NUTCH-1625.patch, NUTCH-1625.patch
>
>
> IndexerMapReduce has the option to skip DB_NOTMODIFIED but legacy code also
> skips FETCH_NOTMODIFIED and the latter is not optional. We can keep the check
> but that should also include FETCH_NOTMODIFIED. Relying on FETCH_NOTMODIFIED
> isn't very useful anyway because since 1.5 orso we can safely rely on
> DB_NOTMODIFIED as it is properly set in the CrawlDBReducer.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)