[Nutch Wiki] Update of "SimpleMapReduceTutorial" by BartoszGadzimski

Apache Wiki Thu, 26 Feb 2009 09:57:53 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by BartoszGadzimski:
http://wiki.apache.org/nutch/SimpleMapReduceTutorial

The comment on the change is:
It is not map reduce tutorial, it's only confusing people

------------------------------------------------------------------------------
- This is the simplest map reduce example I could come up with. Local 
filesystem, just getting one segment indexed. I am running Ubuntu, on an Athlon 
3200+ using a cable modem connection.
+ deleted
  
- == Designate Url ==
- 
- Need to get to the right place
- 
- {{{
- cd nutch/branches/mapred
- }}}
- 
- We need to make a directory that contains files, where each line of each file 
is a url. I choose http://lucene.apache.org/nutch/
- 
- {{{
- mkdir urls
- echo "http://lucene.apache.org/nutch/"; > urls/urls
- }}}
- 
- Also need to change the crawl filter to include this site
- 
- {{{
- perl -pi -e 's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' 
conf/crawl-urlfilter.txt
- }}}
- 
- We walk through the following steps: crawl, generate, fetch, updatedb, 
invertlinks, index.
- 
- == Crawl ==
- 
- We want to run crawl on the urls directory from above.
- 
- {{{
- ./bin/nutch crawl urls
- }}}
- 
- Took me about ten minutes. Output included
- 
- 051004 003916 178 pages, 17 errors, 0.4 pages/s, 48 kb/s
- 
- The errors generally seemed to be timeouts.
- 
- The rest of the commands are a bit more dynamic, relying on timestamp and the 
like. Environment variables help out.
- 
- == Generate ==
- 
- Here we walk a segment dir from the crawl above.
- 
- {{{
- CRAWLDB=`find crawl-2* -name crawldb`
- SEGMENTS_DIR=`find crawl-2* -maxdepth 1 -name segments`
- ./bin/nutch generate $CRAWLDB $SEGMENTS_DIR
- }}}
- 
- Took less than five seconds.
- 
- == Fetch ==
- 
- {{{
- SEGMENT=`find crawl-2*/segments/2* -maxdepth 0 | tail -1`
- ./bin/nutch fetch $SEGMENT
- }}}
- 
- Took about seven minutes, and output looked like
- 
- 051004 004931 65 pages, 404 errors, 0.2 pages/s, 19 kb/s,
- 
- Again, many timeouts.
- 
- == UbdateDB ==
- 
- {{{
- ./bin/nutch updatedb $CRAWLDB $SEGMENT
- }}}
- 
- Took less than five seconds.
- 
- == InvertLinks ==
- 
- {{{
- LINKDB=`find crawl-2* -name linkdb -maxdepth 1`
- SEGMENTS=`find crawl-2* -name segments -maxdepth 1`
- ./bin/nutch invertlinks $LINKDB $SEGMENTS
- }}}
- 
- Took less than five seconds.
- 
- == Index ==
- 
- We need a place for our index, say myindex
- 
- {{{
- mkdir myindex
- }}}
- 
- Now, let's index.
- 
- {{{
- ./bin/nutch index myindex $LINKDB $SEGMENT
- }}}
- 
- Took less than ten seconds.
- 
- == Test ==
- 
- The best test I have for the moment is
- 
- {{{
- ls -alR myindex
- }}}
- 
- If you see several files, it at least did something. Happy nutching!
- 
- Tutorial written by Earl Cahill, 2005.
-

[Nutch Wiki] Update of "SimpleMapReduceTutorial" by BartoszGadzimski

Reply via email to