Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JerryRussell: http://wiki.apache.org/nutch/bin/nutch_mergesegs The comment on the change is: fixed classpath to org.apache ------------------------------------------------------------------------------ - mergesegs is an alias for net.nutch.tools.!SegmentMergeTool + mergesegs is an alias for org.apache.nutch.tools.!SegmentMergeTool This class cleans up accumulated segments data, and merges them into a single (or optionally multiple) segment(s), with no duplicates in it. There are no prerequisites for its correct operation except for a set of already fetched segments (they don't have to contain parsed content, only fetcher output is required). This tool does not use DeleteDuplicates, but creates its own "master" index of all pages in all segments. Then it walks sequentially through this index and picks up only most recent versions of pages for every unique value of url or hash. - If some of the input segments are corrupted, this tool will attempt to repair them, using net.nutch.segment.!SegmentReader.fixSegment(!NutchFileSystem, File, boolean, boolean, boolean, boolean) method. + If some of the input segments are corrupted, this tool will attempt to repair them, using org.apache.nutch.segment.!SegmentReader.fixSegment(!NutchFileSystem, File, boolean, boolean, boolean, boolean) method. Output segment can be optionally split on the fly into several segments of fixed length. @@ -16, +16 @@ You may want to run SegmentMergeTool instead of following the manual procedures, with all options turned on, i.e. to merge segments into the output segment(s), index it, and then delete the original segments data. - Usage: bin/nutch net.nutch.tools.!SegmentMergeTool (-local | -nfs ...) [[BR]] + Usage: bin/nutch org.apache.nutch.tools.!SegmentMergeTool (-local | -nfs ...) [[BR]] (-dir <input_segments_dir> | seg1 seg2 ...) [[BR]] [-o <output_segments_dir>] [-max count] [-i] [-ds] [[BR]] -dir <input_segments_dir> "path to directory containing input segments" [[BR]]
