[Nutch Wiki] Update of "GettingNutchRunningWithUbuntu" by EarlCahill

Apache Wiki Tue, 04 Oct 2005 10:27:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by EarlCahill:
http://wiki.apache.org/nutch/GettingNutchRunningWithUbuntu

New page:
Recently, and with a bit of effort, I got db1.spack up and running on nutch 
trunk. I decided to keep track of what I did to get db2.spack up and running, 
and contribute this tutorial.

== Install Ubuntu ==

Here are some minimal steps:

    * got "The Hoary Hedgehog" from http://www.ubuntu.com/download/
    * entered 'server' on the install screen
    * the rest, I thought, was a breeze
    * I did run 'sudo passwd', which allowed me to do stuff as root, as below

Just a little plug for ubuntu. I guess I have a funny setup. I built an Athlon 
3200+ machine, with on board SATA drives that I wanted to raid, and I wanted to 
run java. Those few things combined together took me a couple months off and 
on, to get it all going. Once I found ubuntu, it took about a night. The java 
took another day or two. Ubuntu was pretty well exactly what I was looking for: 
stripped down debian, that installs almost nothing by default and allows me to 
apt-get install about whatever I want, if the need arises. Could probably 
install ssh by default though.

As a side note, I just spent about five minutes trying these steps on a rather 
old box running debian, and it didn't immediately work, though I will try again 
another day.

== Add Nutch User ==

Let's add a nutch user to do our nutch stuff

{{{
# adduser nutch
}}}

== java ==

 I tried to get java from normal apt sources and I am guessing it is my Athlon 
that broke me. I broke down and got java from Sun 
(http://java.sun.com/j2se/1.5.0/download.jsp), the Download JDK 5.0 Update 4 
link. I tried getting the 1.4.2 and it didn't work, but 1.5.0 worked.

{{{
[EMAIL PROTECTED]:/opt# ./jdk-1_5_0_04-linux-amd64.bin
}}}

Let's put JAVA_HOME in our ~/.bash_profiles, and source said ~/.bash_profiles 
for root and nutch

{{{
# echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
# . ~/.bash_profile
[EMAIL PROTECTED]:~$ echo 'export JAVA_HOME=/opt/jdk1.5.0_04' >> ~/.bash_profile
[EMAIL PROTECTED]:~$ . ~/.bash_profile
}}}

== apt ==

I changed my /etc/apt/sources.list to include

{{{
deb http://ubuntu-backports.mirrormax.net/ hoary-backports main universe 
multiverse restricted
deb http://ubuntu-backports.mirrormax.net/ hoary-extras main universe 
multiverse restricted

deb http://us.archive.ubuntu.com/ubuntu hoary main restricted
deb-src http://us.archive.ubuntu.com/ubuntu hoary main restricted

deb http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted
deb-src http://us.archive.ubuntu.com/ubuntu hoary-updates main restricted

deb http://us.archive.ubuntu.com/ubuntu hoary universe
deb-src http://us.archive.ubuntu.com/ubuntu hoary universe

deb http://security.ubuntu.com/ubuntu hoary-security main restricted
deb-src http://security.ubuntu.com/ubuntu hoary-security main restricted
}}}

With the new apt sources, let's update

{{{
# apt-get update
}}}

And get the packages we need.

{{{
# apt-get install ssh subversion ant lynx
}}}

ssh is just good to have, subversion is used to get nutch, ant is used to build 
nutch and lynx is used to test nutch.

== Build Nutch Code and Index ==

Let's change over to the nutch user

{{{
# su - nutch
}}}

Checkout the code

{{{
[EMAIL PROTECTED]:~$ svn checkout http://svn.apache.org/repos/asf/lucene/nutch/
}}}

Since this tutorial is for getting trunk to work, let's go there

{{{
[EMAIL PROTECTED]:~ $ cd ~/nutch/trunk/
}}}

We build with ant

{{{
[EMAIL PROTECTED]:~/nutch/trunk $ ant
}}}

And build a war for tomcat and later searching

{{{
[EMAIL PROTECTED]:~/nutch/trunk $ ant war
}}}

Follow the nutch tutorial (http://lucene.apache.org/nutch/tutorial.html) to 
build a index, or for a simple index:

{{{
[EMAIL PROTECTED]:~/nutch/trunk $ echo 'http://lucene.apache.org/nutch/' > urls
[EMAIL PROTECTED]:~/nutch/trunk $ perl -pi -e 
's|MY.DOMAIN.NAME|lucene.apache.org/nutch|' \
  conf/crawl-urlfilter.txt
[EMAIL PROTECTED]:~/nutch/trunk $ bin/nutch crawl urls -dir crawl.test -depth 3
}}}

See, perl can be useful :)

== tomcat ==

 Again, I tried apt without much luck, so I downloaded tomcat from Apache 
(http://jakarta.apache.org/site/downloads/downloads_tomcat-4.cgi).

As above, I put the java stuff in /opt

{{{
[EMAIL PROTECTED]:/opt# tar -xzvf jakarta-tomcat-4.1.31.tar.gz
}}}

Out with the old and in with the new

{{{
# rm -rf /opt/jakarta-tomcat-4.1.31/webapps/ROOT*
# cp ~nutch/nutch/trunk/build/nutch-0.8-dev.war \
    /opt/jakarta-tomcat-4.1.31/webapps/ROOT.war
}}}

Let's move to where we put the index

{{{
# cd ~nutch/nutch/trunk/crawl.test
}}}

And start tomcat from there

{{{
# /opt/jakarta-tomcat-4.1.31/bin/catalina.sh start
}}}

== Test ==

Connect to tomcat and perform a search.

{{{
$ lynx localhost:8080
}}}

I searched for 'nutch' and all was well! (you can use <TAB> to get to the 
search input in lynx)

Tutorial written by Earl Cahill, 2005

[Nutch Wiki] Update of "GettingNutchRunningWithUbuntu" by EarlCahill

Reply via email to