Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by JoeyMazzarelli:
http://wiki.apache.org/nutch/NutchTutorial

The comment on the change is:
current path to DmozParser

------------------------------------------------------------------------------
  Next we select a random subset of these pages. (We use a random subset so 
that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ 
contains around three million URLs. We select one out of every 5000, so that we 
end up with around 1000 URLs:
  
  {{{ mkdir dmoz
- bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 > 
dmoz/urls }}}
+ bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > 
dmoz/urls }}}
  
  The parser also takes a few minutes, as it must parse the full file. Finally, 
we initialize the crawl db with the selected urls.
  

Reply via email to