Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch_parse" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_parse?action=diff&rev1=5&rev2=6 Comment: Update to reflect Nutch 1.3 API - parse is an alias for org.apache.nutch.tools.!ParseSegment + Parse is an alias for org.apache.nutch.parse.ParseSegment + The class parses contents in one segment. It assumes, under the given segment, the existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing or as default 'false' boolean value as specified in nutch-default.xml). - Parse contents in one segment. - - It assumes, under given segment, existence of ./fetcher_output/, which is typically generated after a non-parsing fetcher run (i.e., fetcher is started with option -noParsing). Contents in one segment are parsed and saved in these steps: - 1. ./fetcher_output/ and ./content/ are looped together (possibly by multiple ParserThreads), and content is parsed for each entry. The entry number and resultant ParserOutput are saved in ./parser.unsorted. + '''1.''' ./fetcher_output/ and ./content/ are looped together (possibly by multiple ParserThreads), and content is parsed for each entry. The entry number and resultant ParserOutput are saved in ./parser.unsorted. - 2. ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted. + '''2.''' ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted. - 3. ./parser.sorted and ./fetcher_output/ are looped together. At each entry, ParserOutput is split into ParseDate and ParseText, which are saved in ./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput with parsing status, which is saved in ./fetcher/. + '''3.''' ./parser.sorted and ./fetcher_output/ are looped together. At each entry, ParserOutput is split into ParseDate and ParseText, which are saved in ./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput with parsing status, which is saved in ./fetcher/. - In the end, ./fetcher/ should be identical to one resulted from fetcher run WITHOUT option -noParsing. + + In the end, ./fetcher/ should be identical to a directory produced as a result from the fetcher being run WITHOUT option -noParsing e.g. fetching and parsing in the same command. N.B. This is not suggested in a production environment. By default, intermediates ./parser.unsorted and ./parser.sorted are removed at the end, unless option -noClean is used. However ./fetcher_output/ is kept intact. - Check Fetcher.java and FetcherOutput.java for further discussion. + Check Fetcher.java and FetcherOutput.java for further details. - Usage: bin/nutch org.apache.nutch.tools.!ParseSegment (-local | -ndfs <namenode:port>) [-threads n] [-showThreadID] [-dryRun] [-logLevel level] [-noClean] dir + }}} + Usage: bin/nutch parse <segmentdir> + {{{ + + '''<segmentdir>''': This should be the path to the segment directory containing our data for parsing. + CommandLineOptions

