Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by RobPettengill:
http://wiki.apache.org/nutch/bin/nutch_parse

New page:
parse is an alias for net.nutch.tools.!ParseSegment

Parse contents in one segment.

It assumes, under given segment, existence of ./fetcher_output/, which is 
typically generated after a non-parsing fetcher run (i.e., fetcher is started 
with option -noParsing).

Contents in one segemnt are parsed and saved in these steps:

1. ./fetcher_output/ and ./content/ are looped together (possibly by multiple 
ParserThreads), and content is parsed for each entry. The entry number and 
resultant ParserOutput are saved in ./parser.unsorted.

2. ./parser.unsorted is sorted by entry number, result saved as ./parser.sorted.

3. ./parser.sorted and ./fetcher_output/ are looped together. At each entry, 
ParserOutput is split into ParseDate and ParseText, which are saved in 
./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput 
with parsing status, which is saved in ./fetcher/.
In the end, ./fetcher/ should be identical to one resulted from fetcher run 
WITHOUT option -noParsing.

By default, intermediates ./parser.unsorted and ./parser.sorted are removed at 
the end, unless option -noClean is used. However ./fetcher_output/ is kept 
intact.

Check Fetcher.java and FetcherOutput.java for further discussion.

Usage: bin/nutch net.nutch.tools.!ParseSegment (-local | -ndfs <namenode:port>) 
[-threads n] [-showThreadID] [-dryRun] [-logLevel level] [-noClean] dir

[CommandLineOptions]

Reply via email to