Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_mergesegs

New page:
mergesegs is an alias for net.nutch.tools.!SegmentMergeTool

This class cleans up accumulated segments data, and merges them into a single 
(or optionally multiple) segment(s), with no duplicates in it.

There are no prerequisites for its correct operation except for a set of 
already fetched segments (they don't have to contain parsed content, only 
fetcher output is required). This tool does not use DeleteDuplicates, but 
creates its own "master" index of all pages in all segments. Then it walks 
sequentially through this index and picks up only most recent versions of pages 
for every unique value of url or hash.

If some of the input segments are corrupted, this tool will attempt to repair 
them, using net.nutch.segment.!SegmentReader.fixSegment(!NutchFileSystem, File, 
boolean, boolean, boolean, boolean) method.

Output segment can be optionally split on the fly into several segments of 
fixed length.

The newly created segment(s) can be then optionally indexed, so that it can be 
either merged with more new segments, or used for searching as it is.

Old segments may be optionally removed, because all needed data has already 
been copied to the new merged segment. NOTE: this tool will remove also all 
corrupted input segments, which are not useable anyway - however, this option 
may be dangerous if you inadvertently included non-segment directories as 
input...

You may want to run SegmentMergeTool instead of following the manual 
procedures, with all options turned on, i.e. to merge segments into the output 
segment(s), index it, and then delete the original segments data.

Usage: bin/nutch net.nutch.tools.!SegmentMergeTool (-local | -nfs ...) (-dir 
<input_segments_dir> | seg1 seg2 ...) [-o <output_segments_dir>] [-max count] 
[-i] [-ds]
-dir <input_segments_dir>
path to directory containing input segments
seg1 seg2 seg3
individual paths to input segments
-o <output_segment_dir>
(optional) path to directory which will contain output segment(s).
NOTE: If not present, the original segments path will be used.
-max count
(optional) output multiple segments, each with maximum 'count' entries
-i
(optional) index the output segment when finished merging
-ds
(optional) delete the original input segments when finished

Reply via email to