[
https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-2183:
----------------------------------------
Attachment: NUTCH-2183.patch
Patch for trunk.
> Improvement to SegmentChecker for skipping non-segments present in segments
> directory
> -------------------------------------------------------------------------------------
>
> Key: NUTCH-2183
> URL: https://issues.apache.org/jira/browse/NUTCH-2183
> Project: Nutch
> Issue Type: Improvement
> Components: indexer, segment
> Affects Versions: 1.11
> Reporter: Lewis John McGibbney
> Assignee: Lewis John McGibbney
> Fix For: 1.12
>
> Attachments: NUTCH-2183.patch
>
>
> The scenario is that you have a bunch of Nutch data which has been gathered
> over some period of time. Some of the data structures are present, some are
> not. In segments directory for example there is .zip files (don't ask why)
> and in other directories there are .tar.gz files, etc.
> This patch improves the SegmentChecker to skip directories or files present
> within the segments directory which are not 14 characters in length as ALL
> segments are. It also uses this check for individual segments if used by the
> IndexingJob. This means that we can skip the Indexer blowing up if it is run
> on one segment (e.g. without -dir option) and detects some arbitrary
> directory present within segments/ which actually turns out not to be a
> segment afterall.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)