Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by EarlCahill: http://wiki.apache.org/nutch/SimpleMapReduceTutorial New page: This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection. == Designate Url == Need to get to the right place {{{ cd nutch/branches/mapred }}} We need to make a directory that contains files, where each line of each file is a url. I choose http://lucene.apache.org/nutch/ {{{ mkdir urls echo "http://lucene.apache.org/nutch/" > urls/urls }}} Also need to change the crawl filter to include this site {{{ perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt }}} We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index. == Crawl == We want to run crawl on the urls directory from above. {{{ ./bin/nutch crawl urls }}} Took me about ten minutes. Output included 051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s The errors generally seemed to be timeouts. The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out. == Generate == Here we walk a segment dir from the crawl above. {{{ CRAWLDB=`find crawl-2* -name crawldb` SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR }}} Took less than five seconds. == Fetch == {{{ SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` ./bin/nutch fetch $SEGMENT }}} Took about seven minutes, and output looked like 051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s, Again, many timeouts. == UbdateDB == {{{ ./bin/nutch updatedb $CRAWLDB $SEGMENT }}} Took less than five seconds. == InvertLinks == {{{ LINKDB=`find crawl-2* -name linkdb -maxdepth 1` SEGMENTS=`find crawl-2* -name segments -maxdepth 1` ./bin/nutch invertlinks $LINKDB $SEGMENTS }}} Took less than five seconds. == Index == We need a place for our index, say myindex {{{ mkdir myindex }}} Now, let's index. {{{ ./bin/nutch index myindex $LINKDB $SEGMENT }}} Took less than ten seconds. == Test == The best test I have for the moment is {{{ ls -alR myindex }}} If you see several files, it at least did something. Happy nutching! Tutorial written by Earl Cahill, 2005.
