[ 
https://issues.apache.org/jira/browse/NUTCH-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2183:
----------------------------------------
    Attachment: NUTCH-2183.patch

Patch for trunk.

> Improvement to SegmentChecker for skipping non-segments present in segments 
> directory
> -------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2183
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2183
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, segment
>    Affects Versions: 1.11
>            Reporter: Lewis John McGibbney
>            Assignee: Lewis John McGibbney
>             Fix For: 1.12
>
>         Attachments: NUTCH-2183.patch
>
>
> The scenario is that you have a bunch of Nutch data which has been gathered 
> over some period of time. Some of the data structures are present, some are 
> not. In segments directory for example there is .zip files (don't ask why) 
> and in other directories there are .tar.gz files, etc.
> This patch improves the SegmentChecker to skip directories or files present 
> within the segments directory which are not 14 characters in length as ALL 
> segments are. It also uses this check for individual segments if used by the 
> IndexingJob. This means that we can skip the Indexer blowing up if it is run 
> on one segment (e.g. without -dir option) and detects some arbitrary 
> directory present within segments/ which actually turns out not to be a 
> segment afterall.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to