Lewis John McGibbney created NUTCH-2183:
-------------------------------------------

             Summary: Improvement to SegmentChecker for skipping non-segments 
present in segments directory
                 Key: NUTCH-2183
                 URL: https://issues.apache.org/jira/browse/NUTCH-2183
             Project: Nutch
          Issue Type: Improvement
          Components: indexer, segment
    Affects Versions: 1.11
            Reporter: Lewis John McGibbney
            Assignee: Lewis John McGibbney
             Fix For: 1.12


The scenario is that you have a bunch of Nutch data which has been gathered 
over some period of time. Some of the data structures are present, some are 
not. In segments directory for example there is .zip files (don't ask why) and 
in other directories there are .tar.gz files, etc.
This patch improves the SegmentChecker to skip directories or files present 
within the segments directory which are not 14 characters in length as ALL 
segments are. It also uses this check for individual segments if used by the 
IndexingJob. This means that we can skip the Indexer blowing up if it is run on 
one segment (e.g. without -dir option) and detects some arbitrary directory 
present within segments/ which actually turns out not to be a segment afterall.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to