[Nutch Wiki] Update of "bin/nutch_parse" by LewisJohnMcgibbney

Apache Wiki Fri, 01 Jul 2011 15:45:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch_parse" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_parse?action=diff&rev1=5&rev2=6

Comment:
Update to reflect Nutch 1.3 API

- parse is an alias for org.apache.nutch.tools.!ParseSegment
+ Parse is an alias for org.apache.nutch.parse.ParseSegment
  
+ The class parses contents in one segment. It assumes, under the given 
segment, the existence of ./fetcher_output/, which is typically generated after 
a non-parsing fetcher run (i.e., fetcher is started with option -noParsing or 
as default 'false' boolean value as specified in nutch-default.xml).
- Parse contents in one segment.
- 
- It assumes, under given segment, existence of ./fetcher_output/, which is 
typically generated after a non-parsing fetcher run (i.e., fetcher is started 
with option -noParsing).
  
  Contents in one segment are parsed and saved in these steps:
  
- 1. ./fetcher_output/ and ./content/ are looped together (possibly by multiple 
ParserThreads), and content is parsed for each entry. The entry number and 
resultant ParserOutput are saved in ./parser.unsorted.
+ '''1.''' ./fetcher_output/ and ./content/ are looped together (possibly by 
multiple ParserThreads), and content is parsed for each entry. The entry number 
and resultant ParserOutput are saved in ./parser.unsorted.
  
- 2. ./parser.unsorted is sorted by entry number, result saved as 
./parser.sorted.
+ '''2.''' ./parser.unsorted is sorted by entry number, result saved as 
./parser.sorted.
  
- 3. ./parser.sorted and ./fetcher_output/ are looped together. At each entry, 
ParserOutput is split into ParseDate and ParseText, which are saved in 
./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput 
with parsing status, which is saved in ./fetcher/.
+ '''3.''' ./parser.sorted and ./fetcher_output/ are looped together. At each 
entry, ParserOutput is split into ParseDate and ParseText, which are saved in 
./parse_data/ and ./parse_text/ respectively. Also updated is FetcherOutput 
with parsing status, which is saved in ./fetcher/.
- In the end, ./fetcher/ should be identical to one resulted from fetcher run 
WITHOUT option -noParsing.
+ 
+ In the end, ./fetcher/ should be identical to a directory produced as a 
result from the fetcher being run WITHOUT option -noParsing e.g. fetching and 
parsing in the same command. N.B. This is not suggested in a production 
environment.
  
  By default, intermediates ./parser.unsorted and ./parser.sorted are removed 
at the end, unless option -noClean is used. However ./fetcher_output/ is kept 
intact.
  
- Check Fetcher.java and FetcherOutput.java for further discussion.
+ Check Fetcher.java and FetcherOutput.java for further details.
  
- Usage: bin/nutch org.apache.nutch.tools.!ParseSegment (-local | -ndfs 
<namenode:port>) [-threads n] [-showThreadID] [-dryRun] [-logLevel level] 
[-noClean] dir
+ }}}
+ Usage: bin/nutch parse <segmentdir>
+ {{{
+ 
+ '''<segmentdir>''': This should be the path to the segment directory 
containing our data for parsing. 
+ 
  
  CommandLineOptions

[Nutch Wiki] Update of "bin/nutch_parse" by LewisJohnMcgibbney

Reply via email to