Re: [Nutch-general] Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread ogjunk-nutch
Corrado,

Would it be possible for you to add this to the Wiki?

Also, there are several other tutorials:
  http://lucene.apache.org/nutch/tutorial8.html
  http://wiki.apache.org/nutch/NutchTutorial
  http://wiki.apache.org/nutch/NutchHadoopTutorial

Maybe you can combine them?

Otis
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share

- Original Message 
From: zzcgiacomini [EMAIL PROTECTED]
To: nutch-user@lucene.apache.org
Sent: Wednesday, April 4, 2007 10:53:54 AM
Subject: [Nutch-general] Nutch Step by Step Maybe someone will find this useful 
?

I have spent sometime playing with nutch-0 and collecting notes from the 
mailing lists ...
may be someone will find these notes useful end could point me out  
mistakes
I am not at all a nutch expert...
-Corrado

 



 0) CREATE NUTCH USER AND GROUP

Create a nutch user and group and perform all the following logged in as 
nutch user.
put this line in your .bash_profile

export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

downloaded the nutch and hadoop trunks as well explained on 
http://lucene.apache.org/hadoop/version_control.html
(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

Ex: 

Build and produce the tar file
cd hadoop/trunk
ant tar

To build hadoop with native libraries 64bits proceed as follow :

A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old 

tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install

B) compile native 64bit libs for hadoop  if needed

cd hadoop/trunk/src/native

export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64

CCFLAGS=-I/opt/lzo-2.02/include CPPFLAGS=-I/opt/lzo-2.02/include 
./configure

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

in config.h replace the line

#define HADOOP_LZO_LIBRARY libnotfound.so 

with this one

#define HADOOP_LZO_LIBRARY liblzo2.so
make 

 3) BUILD NUTCH

nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want 
to put the last nightly build hadoop jar 

mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar

 4) INSTALL

copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
In my case I only have two identical machines available called myhost2 and 
myhost1. 

On each of them I have installed nutch binaries under /opt/nutch while I 
have dicided to have the hadoop 
distributed filesystem in a directory called hadoopFs located under a large 
disk munted on /disk10


on both machines create the directory:
mkdir /disk10/hadoopFs/ 

copy hadoop 64bit native libraries  if needed

mkdir /opt/nutch/lib/native/Linux-x86_64
cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.

A) on both the machines change the conf/hadoop-site.xml configuration file. 
Here are values I have used 

   fs.default.name : myhost1.mydomain.org:9010
   mapred.job.tracker  : myhost1.mydomain.org:9011
   mapred.map.tasks: 40
   mapred.reduce.tasks : 3
   dfs.name.dir: /opt/hadoopFs/name
   dfs.data.dir: /opt/hadoopFs/data
   mapred.system.dir   : /opt/hadoopFs/mapreduce/system
   mapred.local.dir: /opt/hadoopFs/mapreduce/local
   dfs.replication : 2

   The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
This should be a multiple of the number of computers that you have. 
In our case since we are starting out with 2 computer we will have 4 
map and 4 reduce tasks.

   The dfs.replication property states how many servers a single file 
should be
   replicated to before it becomes available.  Because we are using 2 
servers I have set 
   

Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Tomi N/A

2007/4/5, Enis Soztutar [EMAIL PROTECTED]:

Great work, could you just post these into the nutch wiki as a step by
step tutorial to new comers.


Exactly what I wanted to say, both points. :)

Cheers,
t.n.a.


Re: Nutch Step by Step Maybe someone will find this useful ?

2007-04-05 Thread Enis Soztutar
Great work, could you just post these into the nutch wiki as a step by 
step tutorial to new comers.


zzcgiacomini wrote:
I have spent sometime playing with nutch-0 and collecting notes from 
the mailing lists ...
may be someone will find these notes useful end could point me out  
mistakes

I am not at all a nutch expert...
-Corrado






 0) CREATE NUTCH USER AND GROUP

Create a nutch user and group and perform all the following logged in as 
nutch user.
put this line in your .bash_profile

export JAVA_HOME=/opt/jdk
export PATH=$JAVA_HOME/bin:$PATH

 1) GET HADOOP and NUTCH

downloaded the nutch and hadoop trunks as well explained on 
http://lucene.apache.org/hadoop/version_control.html

(svn checkout http://svn.apache.org/repos/asf/lucene/nutch/trunk)
(svn checkout http://svn.apache.org/repos/asf/lucene/hadoop/trunk)

 2) BUILD HADOOP

Ex: 


Build and produce the tar file
cd hadoop/trunk
ant tar

To build hadoop with native libraries 64bits proceed as follow :

A ) dowonload and install latest lzo library 
(http://www.oberhumer.com/opensource/lzo/download/)
Note: the current available pkgs for fc5 are too old 


tar xvzf lzo-2.02.tar.gz
cd lzo-2.02
./configure --prefix=/opt/lzo-2.02
make install

B) compile native 64bit libs for hadoop  if needed

cd hadoop/trunk/src/native

export LDFLAGS=-L/opt/jdk/jre/lib/amd64/server
export JVM_DATA_MODEL=64

CCFLAGS=-I/opt/lzo-2.02/include CPPFLAGS=-I/opt/lzo-2.02/include 
./configure

cp src/org_apache_hadoop.h src/org/apache/hadoop/io/compress/zlib/
cp src/org_apache_hadoop.h ./src/org/apache/hadoop/io/compress/lzo
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibCompressor.h
cp 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib.h 
src/org/apache/hadoop/io/compress/zlib/org_apache_hadoop_io_compress_zlib_ZlibDecompressor.h

in config.h replace the line

#define HADOOP_LZO_LIBRARY libnotfound.so 


with this one

#define HADOOP_LZO_LIBRARY liblzo2.so
make 


 3) BUILD NUTCH

nutch-dev nigthly trunk now comes with hadoop-0.12.jar but may be you want to put the last nightly build hadoop jar 


mv nutch/trunk/lib/hadoop-0.12.jar nutch/trunk/lib/hadoop-0.12.jar.ori
cp hadoop/trunk/build/hadoop-0.12.3-dev.jar nutch/trunk/lib/hadoop-0.12.jar
cd nutch/trunk
ant tar

 4) INSTALL

copy and untar the genearated .tar.gz file on the machines that will 
participate to the engine activities
In my case I only have two identical machines available called myhost2 and myhost1. 

On each of them I have installed nutch binaries under /opt/nutch while I have dicided to have the hadoop 
distributed filesystem in a directory called hadoopFs located under a large disk munted on /disk10



on both machines create the directory:
mkdir /disk10/hadoopFs/ 


copy hadoop 64bit native libraries  if needed

mkdir /opt/nutch/lib/native/Linux-x86_64

cp -fl hadoop/trunk/src/native/lib/.libs/* 
/opt/nutch/lib/native/Linux-x86_64

 5) CONFIG

I will use the myhost1 as the master machine running the nodename and 
jobtracker tasks; it will also run the datanode and tasktraker on it.
myhost2 will only run datanode and takstraker.

A) on both the machines change the conf/hadoop-site.xml configuration file. Here are values I have used 


   fs.default.name : myhost1.mydomain.org:9010
   mapred.job.tracker  : myhost1.mydomain.org:9011
   mapred.map.tasks: 40
   mapred.reduce.tasks : 3
   dfs.name.dir: /opt/hadoopFs/name
   dfs.data.dir: /opt/hadoopFs/data
   mapred.system.dir   : /opt/hadoopFs/mapreduce/system
   mapred.local.dir: /opt/hadoopFs/mapreduce/local
   dfs.replication : 2

   The mapred.map.tasks property tell how many tasks you want to run in 
parallel.
This should be a multiple of the number of computers that you have. 
In our case since we are starting out with 2 computer we will have 4 map and 4 reduce tasks.


   The dfs.replication property states how many servers a single file 
should be
   replicated to before it becomes available.  Because we are using 2 servers I have set 
   this at 2. 


   may be you want also change nutch-site by adding  with a different value 
then the default of 3

   http.redirect.max   : 10

 
B) be sure that your  conf/slaves file contains the name of the slaves machines. In my cases:


   myhost1.mydomain.org
   myhost2.mydomain.org

C) create directories for pids and log files on both machines

   mkdir /opt/nutch/pids
   mkdir