[Nutch-general] FW: (Hadoop) Running WordCount in pseudo-distributed configuration

Jon Blower Tue, 28 Feb 2006 00:49:32 -0800

Dear all,

This is a copy of a conversation from the hadoop mailing list - I am
forwarding it to this list as I thought there may be interested parties on
this list that are not on the hadoop list.  Apologies for cross-posting.  It
describes a problem I had with jobs hanging in Hadoop and how I fixed it.
(Essentially, the problem was caused by the Jetty web server not running
properly.)

Regards, Jon

-----Original Message-----
From: Jon Blower [mailto:[EMAIL PROTECTED] 
Sent: 27 February 2006 21:20
To: [email protected]
Subject: RE: Running WordCount in pseudo-distributed configuration

Hi everyone,

After some investigation I've managed to fix my problem.  In short, the
problem is caused by the Jetty web server failing to run.  This is part of
the jobtracker component: if this isn't running then jobs will simply hang.

Jetty was not starting up properly because of two problems with the web.xml
file that is generated when building hadoop:
   1) web.xml is validated with a DTD that is supposed to be downloaded from
the web.  However, my server does not have a connection to the Internet.
   2) The contents of web.xml do not seem to be valid anyway (they cause
ClassNotFoundExceptions).

You need to make a few changes to ensure that the system sees a valid
web.xml and doesn't try to validate it with a DTD.  Here's how I did it.
Basically, I created the webapps directory manually instead of allowing it
to be built by the build script:

1) Edited build.xml:
   i) In the "init" target, removed the "<mkdir
dir="${build.webapps}/WEB-INF"/>" and the "<copy todir="${build.webapps}"
..." portions.  This stops the build file from automatically building the
build/webapps directory.
   ii) In the "compile" target, removed the "jsp-compile" portion.  This
stops the system from building a new web.xml
   iii) In the "jar" target, removed the "<zipfileset dir="webapps"..."
portion.  This stops the webapps directory from appearing in the hadoop jar
file

2) Created build/webapps and populated it:
   i) Moved src/webapps/index.html and src/webapps/mapred/*.jsp into
build/webapps.
   ii) Created a directory called WEB-INF in build/webapps
   iii) Inside WEB-INF, created a file called web.xml with the following
contents:

<?xml version="1.0" encoding="ISO-8859-1"?> <web-app> </web-app>

      The file doesn't need to contain any more information: the web
application is made up of JSP files that do not need to be deployed.

3) Edited bin/hadoop:
   Added the argument "-Dorg.mortbay.xml.XmlParser.NotValidating=true" to
the final line in the file (the one that starts 'exec "$JAVA" ...').  This
stops Jetty from trying to validate web.xml with a DTD.

4 In src/java/org/apache/hadoop/mapred/, changed the source files so that
the TaskTrackerStatus, JobInProgress, JobProfile and JobStatus classes are
public, not package-private.  This is required for the JSP files to be able
to use these classes.

5) Ran "ant jar" to build the hadoop library and ran "cp
build/hadoop-0.1-dev.jar ." to copy the JAR into the hadoop home directory.
Deleted the existing hadoop-nightly.jar file.  This makes sure that the new
JAR file is picked up.

6) Edited conf/hadoop.site.xml to have the following contents:
<property>  <name>fs.default.name</name>  <value>localhost:9000</value>
</property> <property>  <name>mapred.job.tracker</name>
<value>localhost:9001</value> </property> <property>
<name>dfs.replication</name>  <value>1</value>  </property>

7) Ran bin/start-all.sh.  Checked the log files to make sure no exceptions
were being thrown.  Note that the build/webapps directory that I created in
step (2) is automatically found and used as the base for the web
application.  It's important to follow step (1) above to prevent a
conflicting webapps directory from appearing in the JAR file.

8) In a web browser, opened http://localhost:50030.  This gave me a link to
the Job Tracker, which I could follow and monitor the jobs as I submitted
them:

9) uploaded some input files (in a directory called "in") to the distributed
file system "bin/hadoop dfs -put in in"

10) ran a test job: "bin/hadoop org.apache.hadoop.examples.WordCount in
out".  Monitored the job's progress using the web interface.

11) Downloaded the output files "bin/hadoop dfs -get out out"

Success!  I hope this is useful to others who have been struggling to get
started, and for the developers.

There is one more thing that I needed to do on my system (Red Hat 9),
because the version of ssh is older than is assumed by hadoop:

    In bin/slaves.sh, near the bottom, removed the option "-o
ConnectTimeout=1" from the call to ssh.  The ConnectTimeout option is not
understood by my version of ssh (OpenSSH 3.5p1)

Jon

> -----Original Message-----
> From: Jon Blower [mailto:[EMAIL PROTECTED]
> Sent: 27 February 2006 11:55
> To: Hadoop mailing list
> Subject: Running WordCount in pseudo-distributed configuration
> 
> Hi all,
> 
> I am having the same problem that Ramanan reported to this list (I 
> haven't seen a reply to Ramanan's question).  I am trying to run the 
> WordCount example in pseudo-distributed mode, i.e. everything running 
> on localhost with the following contents in hadoop-site.xml:
> 
>    <property>  <name>fs.default.name</name> 
> <value>localhost:9000</value>
> </property>       
>    <property>  <name>mapred.job.tracker</name>
> <value>localhost:9001</value>  </property>     
>    <property>  <name>dfs.replication</name>  <value>1</value>  
> </property>
> 
> I can run bin/start-all.sh without problems and I can upload and 
> download material to and from the DFS.  However, when I run
> 
>    bin/hadoop org.apache.hadoop.examples.WordCount in out
> 
> I get the following output:
> 
> 060227 114628 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-default.xml
> 060227 114628 parsing
> file:/users/resc/programs/hadoop-nightly/conf/mapred-default.xml
> 060227 114628 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-site.xml
> 060227 114628 Client connection to 127.0.0.1:9001: starting
> 060227 114628 Client connection to 127.0.0.1:9000: starting
> 060227 114628 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-default.xml
> 060227 114628 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-site.xml
> 060227 114629 Running job: job_17c13e
> 060227 114630  map 0%  reduce 0%
> 
> ... and the program just hangs.  The input directory "in" has been 
> uploaded to the DFS.  The WordCount program works fine in standalone 
> mode.
> 
> Does anyone know what's going wrong?  I am using the nightly build 
> from
> 27-02-2006 on Red Hat Linux 9.
> 
> There is suspicious activity in the log files.  The jobtracker log 
> contains a lot of exceptions, including:
> 
> 060227 114952 Web application not found 
> /users/resc/programs/hadoop-nightly/file:/users/resc/programs/
> hadoop-nightly
> /hadoop-nightly.jar!/webapps
> 060227 114952 Configuration error on
> /users/resc/programs/hadoop-nightly/file:/users/resc/programs/
> hadoop-nightly
> /hadoop-nightly.jar!/webapps
> java.io.FileNotFoundException:
> /users/resc/programs/hadoop-nightly/file:/users/resc/programs/
> hadoop-nightly
> /hadoop-nightly.jar!/webapps
> 
> Also:
> 
> 060227 114953 Starting tracker
> java.io.IOException: Could not start HTTP server
>       at
> org.apache.hadoop.mapred.JobTrackerInfoServer.start(JobTracker
> InfoServer.jav
> a:104)
>       at
> org.apache.hadoop.mapred.JobTracker.<init>(JobTracker.java:304)
>       at
> org.apache.hadoop.mapred.JobTracker.startTracker(JobTracker.java:50)
>       at org.apache.hadoop.mapred.JobTracker.main(JobTracker.java:820)
> 060227 114954 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-default.xml
> 060227 114954 parsing
> file:/users/resc/programs/hadoop-nightly/conf/mapred-default.xml
> 060227 114954 parsing
> file:/users/resc/programs/hadoop-nightly/conf/hadoop-site.xml
> 060227 114954 Starting tracker
> java.net.BindException: Address already in use
>       at java.net.PlainSocketImpl.socketBind(Native Method)
> 
> There are a large number of BindExceptions, all of which follow a 
> "Starting tracker" message.  This happens for only one invocation of 
> start-all.sh.  I don't understand why it's apparently trying to start 
> the tracker so many times.
> 
> Thanks in advance,
> Jon
> 
> 
> --------------------------------------------------------------
> Dr Jon Blower              Tel: +44 118 378 5213 (direct line)
> Technical Director         Tel: +44 118 378 8741 (ESSC)
> Reading e-Science Centre   Fax: +44 118 378 6413
> ESSC                       Email: [EMAIL PROTECTED]
> University of Reading
> 3 Earley Gate
> Reading RG6 6AL, UK
> --------------------------------------------------------------
> 
> 

-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] FW: (Hadoop) Running WordCount in pseudo-distributed configuration

Reply via email to