Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by BartoszGadzimski: http://wiki.apache.org/nutch/SimpleMapReduceTutorial The comment on the change is: It is not map reduce tutorial, it's only confusing people ------------------------------------------------------------------------------ - This is the simplest map reduce example I could come up with. Local filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 3200+ using a cable modem connection. + deleted - == Designate Url == - - Need to get to the right place - - {{{ - cd nutch/branches/mapred - }}} - - We need to make a directory that contains files, where each line of each file is a url. I choose http://lucene.apache.org/nutch/ - - {{{ - mkdir urls - echo "http://lucene.apache.org/nutch/" > urls/urls - }}} - - Also need to change the crawl filter to include this site - - {{{ - perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' conf/crawl-urlfilter.txt - }}} - - We walk through the following steps: crawl, generate, fetch, updatedb, invertlinks, index. - - == Crawl == - - We want to run crawl on the urls directory from above. - - {{{ - ./bin/nutch crawl urls - }}} - - Took me about ten minutes. Output included - - 051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s - - The errors generally seemed to be timeouts. - - The rest of the commands are a bit more dynamic, relying on timestamp and the like. Environment variables help out. - - == Generate == - - Here we walk a segment dir from the crawl above. - - {{{ - CRAWLDB=`find crawl-2* -name crawldb` - SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments` - ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR - }}} - - Took less than five seconds. - - == Fetch == - - {{{ - SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1` - ./bin/nutch fetch $SEGMENT - }}} - - Took about seven minutes, and output looked like - - 051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s, - - Again, many timeouts. - - == UbdateDB == - - {{{ - ./bin/nutch updatedb $CRAWLDB $SEGMENT - }}} - - Took less than five seconds. - - == InvertLinks == - - {{{ - LINKDB=`find crawl-2* -name linkdb -maxdepth 1` - SEGMENTS=`find crawl-2* -name segments -maxdepth 1` - ./bin/nutch invertlinks $LINKDB $SEGMENTS - }}} - - Took less than five seconds. - - == Index == - - We need a place for our index, say myindex - - {{{ - mkdir myindex - }}} - - Now, let's index. - - {{{ - ./bin/nutch index myindex $LINKDB $SEGMENT - }}} - - Took less than ten seconds. - - == Test == - - The best test I have for the moment is - - {{{ - ls -alR myindex - }}} - - If you see several files, it at least did something. Happy nutching! - - Tutorial written by Earl Cahill, 2005. -