[Nutch Wiki] Update of "bin/nutch parse" by kiranchitturi

Apache Wiki Wed, 20 Mar 2013 11:09:34 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch parse" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/bin/nutch%20parse

New page:
Parse is an alias for org.apache.nutch.parse.ParseSegment

The class parses contents in one segment. It assumes, under the given segment, 
the existence of ./fetcher_output/, which is typically generated after a 
non-parsing fetcher run (i.e., fetcher is started with option -noParsing or as 
default 'false' boolean value as specified in nutch-default.xml).

Contents in one segment are parsed and saved in these steps:

'''1.''' ./fetcher_output/ and ./content/ are looped together (possibly by 
multiple ParserThreads), and content is parsed for each entry. The entry number 
and resultant ParserOutput are saved in ./parser.unsorted.

'''2.''' ./parser.unsorted is sorted by entry number, result saved as 
./parser.sorted.

'''3.''' ./parser.sorted and ./fetcher_output/ are looped together. At each 
entry, ParserOutput is split into ParseDate and ParseText, which are saved in 
./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput 
with parsing status, which is saved in ./fetcher/.

In the end, ./fetcher/ should be identical to a directory produced as a result 
from the fetcher being run WITHOUT option -noParsing e.g. fetching and parsing 
in the same command. N.B. This is not suggested in a production environment.

By default, intermediates ./parser.unsorted and ./parser.sorted are removed at 
the end, unless option -noClean is used. However ./fetcher_output/ is kept 
intact.

Check Fetcher.java and FetcherOutput.java for further details.

{{{
Usage: bin/nutch parse <segmentdir>
}}}

'''<segmentdir>''': This should be the path to the segment directory containing 
our data for parsing. 


CommandLineOptions

[Nutch Wiki] Update of "bin/nutch parse" by kiranchitturi

Reply via email to