[Nutch Wiki] Update of "bin/nutch_mergesegs" by LewisJohnMcgibbney

Apache Wiki Fri, 01 Jul 2011 23:57:07 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch_mergesegs" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_mergesegs?action=diff&rev1=5&rev2=6

Comment:
Update to reflect Nutch 1.3 API

- mergesegs is an alias for org.apache.nutch.tools.!SegmentMergeTool
+ Mergesegs is an alias for org.apache.nutch.segment.SegmentMerger
  
- This class cleans up accumulated segments data, and merges them into a single 
(or optionally multiple) segment(s), with no duplicates in it.
+ This tool takes several segments and merges their data together. Only the 
latest versions of data is retained. Optionally, you can apply current 
URLFilters to remove prohibited URL-s. Also, it's possible to slice the 
resulting segment into chunks of fixed size.
  
- There are no prerequisites for its correct operation except for a set of 
already fetched segments (they don't have to contain parsed content, only 
fetcher output is required). This tool does not use DeleteDuplicates, but 
creates its own "master" index of all pages in all segments. Then it walks 
sequentially through this index and picks up only most recent versions of pages 
for every unique value of url or hash.
+ ==Important Notes==
  
- If some of the input segments are corrupted, this tool will attempt to repair 
them, using 
org.apache.nutch.segment.!SegmentReader.fixSegment(!NutchFileSystem, File, 
boolean, boolean, boolean, boolean) method.
+ ===Which parts are merged?===
+ It doesn't make sense to merge data from segments, which are at different 
stages of processing (e.g. one unfetched segment, one fetched but not parsed, 
and one fetched and parsed). Therefore, prior to merging, the tool will 
determine the lowest common set of input data, and only this data will be 
merged. This may have some unintended consequences: e.g. if majority of input 
segments are fetched and parsed, but one of them is unfetched, the tool will 
fall back to just merging fetchlists, and it will skip all other data from all 
segments.
  
- Output segment can be optionally split on the fly into several segments of 
fixed length.
+ ===Merging fetchlists===
+ Merging segments, which contain just fetchlists (i.e. prior to fetching) is 
not recommended, because this tool (unlike the {@link 
org.apache.nutch.crawl.Generator} doesn't ensure that fetchlist parts for each 
map task are disjoint.
  
- The newly created segment(s) can be then optionally indexed, so that it can 
be either merged with more new segments, or used for searching as it is.
+ ===Duplicate content===
+ Merging segments removes older content whenever possible (see below). 
However, this is NOT the same as de-duplication, which in addition removes 
identical content found at different URL-s. In other words, running a command 
to delete duplicates is still necessary.
  
- Old segments may be optionally removed, because all needed data has already 
been copied to the new merged segment. NOTE: this tool will remove also all 
corrupted input segments, which are not useable anyway - however, this option 
may be dangerous if you inadvertently included non-segment directories as 
input...
+ For some types of data (especially ParseText) it's not possible to determine 
which version is really older. Therefore the tool always uses segment names as 
timestamps, for all types of input data. Segment names are compared in forward 
lexicographic order (0-9a-zA-Z), and data from segments with "higher" names 
will prevail. It follows then that it is extremely important that segments be 
named in an increasing lexicographic order as their creation time increases.
  
- You may want to run SegmentMergeTool instead of following the manual 
procedures, with all options turned on, i.e. to merge segments into the output 
segment(s), index it, and then delete the original segments data.
+ ===Merging and indexes===
+ Merged segment gets a different name. Since Indexer embeds segment names in 
indexes, any indexes originally created for the input segments will NOT work 
with the merged segment. Newly created merged segment(s) need to be indexed 
afresh. This tool doesn't use existing indexes in any way, so if you plan to 
merge segments you don't have to index them prior to merging.
  
- Usage: bin/nutch org.apache.nutch.tools.!SegmentMergeTool (-local | -nfs ...) 
<<BR>>
- (-dir <input_segments_dir> | seg1 seg2 ...) <<BR>>
- [-o <output_segments_dir>] [-max count] [-i] [-ds] <<BR>>
- -dir <input_segments_dir> "path to directory containing input segments" <<BR>>
- seg1 seg2 seg3 "... individual paths to input segments" <<BR>>
- -o <output_segment_dir> "(optional) path to directory which will contain 
output segment(s).
- NOTE: If not present, the original segments path will be used." <<BR>>
- -max count "(optional) output multiple segments, each with maximum 'count' 
entries"<<BR>>
- -i "(optional) index the output segment when finished merging" <<BR>>
- -ds "(optional) delete the original input segments when finished"
+ There are no prerequisites for correct operation of merging segments except 
for a set of already fetched segments (they don't have to contain parsed 
content, only fetcher output is required).
+ 
+ Usage:
+ {{{
+ bin/nutch mergesegs <output_dir> (-dir segments | seg1 seg2 ...) [-filter] 
[-slice NNNN]
+ }}}
+ 
+ '''<output_dir>''': This is the path name of the parent directory for output 
segment slice(s)
+ 
+ '''-dir segments''': The path to the parent directory containing several 
segments.
+ 
+ '''seg1 seg2 ...''': This parameter should be a comprehensive list of segment 
directories to be merged.
+ 
+ '''[-filter]''': This enables us to filter out URLs based upon current 
URLFilters we wish to implement. This can be used to improve the quality of the 
resulting segments after a merge is executed.
+ 
+ '''[-slice NNNN]''': This arguement should be passed if we wish to create 
many output segments, each containing NNNN URLs. e.g. If we wanted to merge 10 
segments each containing 20 URLS into 5 segments each containg 40 URLs.
+ 
  
  CommandLineOptions

[Nutch Wiki] Update of "bin/nutch_mergesegs" by LewisJohnMcgibbney

Reply via email to