Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_segslice

New page:
segslice is an alias for net.nutch.segment.!SegmentSlicer

This class reads data from one or more input segments, and outputs it to one or 
more output segments, optionally deleting the input segments when it's finished.

Data is read sequentially from input segments, and appended to output segment 
until it reaches the target count of entries, at which point the next output 
segment is created, and so on.

NOTE 1: this tool does NOT de-duplicate data - use SegmentMergeTool for that.

NOTE 2: this tool does NOT copy indexes. It is currently impossible to slice 
Lucene indexes. The proper procedure is first to create slices, and then to 
index them.

NOTE 3: if one or more input segments are in non-parsed format, the output 
segments will also use non-parsed format. This means that any parseData and 
parseText data from input segments will NOT be copied to the output segments.

Usage: bin/nutch net.nutch.segment.!SegmentSlicer (-local | -ndfs 
<namenode:port>) -o outputDir [-max count] [-fix] [-nocontent] [-noparsedata] 
[-noparsetext] (-dir segments | seg1 seg2 ...)[[BR]]
NOTE: at least one segment dir name is required, or '-dir' option.
outputDir is always required.[[BR]]
-o outputDir[[BR]]
  output directory for segments[[BR]]
-max count[[BR]]
  (optional) output multiple segments, each with maximum 'count' entries[[BR]]
-fix[[BR]]
  (optional) automatically fix corrupted segments[[BR]]
-nocontent[[BR]]
  (optional) ignore content data[[BR]]
-noparsedata[[BR]]
  (optional) ignore parse_data data[[BR]]
-nocontent[[BR]]
  (optional) ignore parse_text data[[BR]]
-dir segments[[BR]]
  directory containing multiple segments[[BR]]
seg1 seg2 ...[[BR]]
  segment directories[[BR]]

[CommandLineOptions]

Reply via email to