Lewis John McGibbney created NUTCH-2183:
-------------------------------------------
Summary: Improvement to SegmentChecker for skipping non-segments
present in segments directory
Key: NUTCH-2183
URL: https://issues.apache.org/jira/browse/NUTCH-2183
Project: Nutch
Issue Type: Improvement
Components: indexer, segment
Affects Versions: 1.11
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Fix For: 1.12
The scenario is that you have a bunch of Nutch data which has been gathered
over some period of time. Some of the data structures are present, some are
not. In segments directory for example there is .zip files (don't ask why) and
in other directories there are .tar.gz files, etc.
This patch improves the SegmentChecker to skip directories or files present
within the segments directory which are not 14 characters in length as ALL
segments are. It also uses this check for individual segments if used by the
IndexingJob. This means that we can skip the Indexer blowing up if it is run on
one segment (e.g. without -dir option) and detects some arbitrary directory
present within segments/ which actually turns out not to be a segment afterall.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)