Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch_readseg" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_readseg Comment: Update to reflect Nutch 1.3 API New page: Readseg is an alias for org.apache.nutch.segment.SegmentReader This class is similar to readdb in that it dumps the contents of a segment. There are three ways we can use this class: }}} 1st Usage: bin/nutch readseg -dump <segment_dir> <output> [general options] {{{ '''-dump''': Dumps content of a <segment_dir> as a text file to <output>. '''[general options]''': General options are provided below. }}} 2nd Usage: bin/nutch readseg -list (<segment_dir1> ... | -dir <segments>) [general options] {{{ '''-list''': This arguement lists a synopsis of segments in specified directories, or all segments in a directory <segments>, and prints details of them to System.out. '''<segment_dir1> ...''': This should be a list of the paths for individual segment directories to process. '''-dir <segments>''': Should be a path to a directory that contains multiple segments. '''[general options]''': General options are provided below. }}} 3rd Usage: bin/nutch readseg -get <segment_dir> <keyValue> [general options] {{{ '''-get''': This arguement gets a specified record from a segment, and prints it on System.out. '''<segment_dir>''': Path to the segment directory. '''<keyValue>''': This should be the value of the key (url) we wish to retreive specific information about. N.B. It is essential to put "double-quotes" around strings with spaces. '''[general options]''': General options are provided below. * '''-nocontent''': Pass this to ignore the content directory. * '''-nofetch''': To ignore the crawl_fetch directory. * '''-nogenerate''': To ignore the crawl_generate directory. * '''-noparse''': To ignore the crawl_parse directory. * '''-noparsedata''': To ignore the parse_data directory. * '''-noparsetext''': To ignore the parse_text directory. CommandLineOptions

