Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?
Corrado, Would it be possible for you to add this to the Wiki? Also, there are several other tutorials: http://lucene.apache.org/nutch/tutorial8.html http://wiki.apache.org/nutch/NutchTutorial http://wiki.apache.org/nutch/NutchHadoopTutorial Maybe you can combine them? Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share - Original Message From: zzcgiacomini [EMAIL PROTECTED] To: nutch-user@lucene.apache.org Sent: Wednesday, April 4, 2007 10:53:54 AM Subject: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ? I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am not at all a nutch expert... -Corrado 0) CREATE NUTCH USER AND GROUP Create a nutch user and group and perform all the following logged in as nutch user. put this line in your .bash_profile export JAVA_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH 1) GET HADOOP and NUTCH downloaded the nutch and hadoop trunks as well explained on http://lucene.apache.org/hadoop/version_control.html (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) 2) BUILD HADOOP Ex: Build and produce the tar file cd hadoop/trunk ant tar To build hadoop with native libraries 64bits proceed as follow : A ) dowonload and install latest lzo library (http://www.oberhumer.com/opensource/lzo/download/) Note: the current available pkgs for fc5 are too old tar xvzf lzo-2.02.tar.gz cd lzo-2.02 ./configure --prefix=/opt/lzo-2.02 make install B) compile native 64bit libs for hadoop if needed cd hadoop/trunk/src/native export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server export JVM_DATA_MODEL=64 CCFLAGS=-I/opt/lzo-2.02/include CPPFLAGS=-I/opt/lzo-2.02/include ./configure cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h in config.h replace the line #define HADOOP_LZO_LIBRARY libnotfound.so with this one #define HADOOP_LZO_LIBRARY liblzo2.so make 3) BUILD NUTCH nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar cd nutch/trunk ant tar 4) INSTALL copy and untar the genearated .tar.gz file on the machines that will participate to the engine activities In my case I only have two identical machines available called myhost2 and myhost1. On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10 on both machines create the directory: mkdir /disk10/hadoopFs/ copy hadoop 64bit native libraries if needed mkdir /opt/nutch/lib/native/Linux-x86_64 cp -fl hadoop/trunk/src/native/lib/.libs/* /opt/nutch/lib/native/Linux-x86_64 5) CONFIG I will use the myhost1 as the master machine running the nodename and jobtracker tasks; it will also run the datanode and tasktraker on it. myhost2 will only run datanode and takstraker. A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used fs.default.name : myhost1.mydomain.org:9010 mapred.job.tracker : myhost1.mydomain.org:9011 mapred.map.tasks: 40 mapred.reduce.tasks : 3 dfs.name.dir: /opt/hadoopFs/name dfs.data.dir: /opt/hadoopFs/data mapred.system.dir : /opt/hadoopFs/mapreduce/system mapred.local.dir: /opt/hadoopFs/mapreduce/local dfs.replication : 2 The mapred.map.tasks property tell how many tasks you want to run in parallel. This should be a multiple of the number of computers that you have. In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks. The dfs.replication property states how many servers a single file should be replicated to before it becomes available. Because we are using 2 servers I have set
Re: Nutch Step by Step Maybe someone will find this useful ?
2007/4/5, Enis Soztutar [EMAIL PROTECTED]: Great work, could you just post these into the nutch wiki as a step by step tutorial to new comers. Exactly what I wanted to say, both points. :) Cheers, t.n.a.
Re: Nutch Step by Step Maybe someone will find this useful ?
Great work, could you just post these into the nutch wiki as a step by step tutorial to new comers. zzcgiacomini wrote: I have spent sometime playing with nutch-0 and collecting notes from the mailing lists ... may be someone will find these notes useful end could point me out mistakes I am not at all a nutch expert... -Corrado 0) CREATE NUTCH USER AND GROUP Create a nutch user and group and perform all the following logged in as nutch user. put this line in your .bash_profile export JAVA_HOME=/opt/jdk export PATH=$JAVA_HOME/bin:$PATH 1) GET HADOOP and NUTCH downloaded the nutch and hadoop trunks as well explained on http://lucene.apache.org/hadoop/version_control.html (svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk) (svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk) 2) BUILD HADOOP Ex: Build and produce the tar file cd hadoop/trunk ant tar To build hadoop with native libraries 64bits proceed as follow : A ) dowonload and install latest lzo library (http://www.oberhumer.com/opensource/lzo/download/) Note: the current available pkgs for fc5 are too old tar xvzf lzo-2.02.tar.gz cd lzo-2.02 ./configure --prefix=/opt/lzo-2.02 make install B) compile native 64bit libs for hadoop if needed cd hadoop/trunk/src/native export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server export JVM_DATA_MODEL=64 CCFLAGS=-I/opt/lzo-2.02/include CPPFLAGS=-I/opt/lzo-2.02/include ./configure cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/ cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h cp src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h in config.h replace the line #define HADOOP_LZO_LIBRARY libnotfound.so with this one #define HADOOP_LZO_LIBRARY liblzo2.so make 3) BUILD NUTCH nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar cd nutch/trunk ant tar 4) INSTALL copy and untar the genearated .tar.gz file on the machines that will participate to the engine activities In my case I only have two identical machines available called myhost2 and myhost1. On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10 on both machines create the directory: mkdir /disk10/hadoopFs/ copy hadoop 64bit native libraries if needed mkdir /opt/nutch/lib/native/Linux-x86_64 cp -fl hadoop/trunk/src/native/lib/.libs/* /opt/nutch/lib/native/Linux-x86_64 5) CONFIG I will use the myhost1 as the master machine running the nodename and jobtracker tasks; it will also run the datanode and tasktraker on it. myhost2 will only run datanode and takstraker. A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used fs.default.name : myhost1.mydomain.org:9010 mapred.job.tracker : myhost1.mydomain.org:9011 mapred.map.tasks: 40 mapred.reduce.tasks : 3 dfs.name.dir: /opt/hadoopFs/name dfs.data.dir: /opt/hadoopFs/data mapred.system.dir : /opt/hadoopFs/mapreduce/system mapred.local.dir: /opt/hadoopFs/mapreduce/local dfs.replication : 2 The mapred.map.tasks property tell how many tasks you want to run in parallel. This should be a multiple of the number of computers that you have. In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks. The dfs.replication property states how many servers a single file should be replicated to before it becomes available. Because we are using 2 servers I have set this at 2. may be you want also change nutch-site by adding with a different value then the default of 3 http.redirect.max : 10 B) be sure that your conf/slaves file contains the name of the slaves machines. In my cases: myhost1.mydomain.org myhost2.mydomain.org C) create directories for pids and log files on both machines mkdir /opt/nutch/pids mkdir