Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by JoeyMazzarelli: http://wiki.apache.org/nutch/NutchTutorial The comment on the change is: current path to DmozParser ------------------------------------------------------------------------------ Next we select a random subset of these pages. (We use a random subset so that everyone who runs this tutorial doesn't hammer the same sites.) DMOZ contains around three million URLs. We select one out of every 5000, so that we end up with around 1000 URLs: {{{ mkdir dmoz - bin/nutch org.apache.nutch.crawl.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls }}} + bin/nutch org.apache.nutch.tools.DmozParser content.rdf.u8 -subset 5000 > dmoz/urls }}} The parser also takes a few minutes, as it must parse the full file. Finally, we initialize the crawl db with the selected urls.