Hi Thomas (and Debian developers),

My responses inline below:

On Wed, Dec 30, 2009 at 10:53 AM, Thomas Koch <[email protected]> wrote:

> Hi,
>
> today I tried to run the cloudera debian dist on a 4 machine cluster. I
> still
> have some itches, see my list below. Some of them may require a fix in the
> packaging.
> Therefor I thought that it may be time to start an official debian package
> of
> hadoop with a public GIT repository so that everybody can participate.
> Would cloudera support this? I'd package hadoop 0.20 and apply all the
> cloudera patches (managed with topgit[1]).
>

Our distribution and all of our patches are licensed Apache 2.0, so we're of
course happy to have the community work on packaging with us. Our goals are
slightly different than the Debian project's goals, which is why we run our
own repository separate from the Debian/Ubuntu distributions. One main
difference is that we have an internal build process that ensures parity
between the RPM and Debian packages with little manual work; most existing
packaging solutions we found were either Debian or RPM specific. A second
difference is that we believe Hadoop is most stable when we maintain the
exact versions of depended-upon Java software within Hadoop's own directory,
which is contrary to the Debian Java packaging policy.

That said, we're certainly happy to help the community with questions about
our packages or patch set.


> At this point I'd like to have your opinion whether it would be wise to
> have
> versioned binary packages like hadoop-18, hadoop-20 or just plain hadoop
> for
> the Debian package?
>
>
We originally had a single package "hadoop", and later moved to the
versioned packages due to community demand. Some users prefer to run older
versions of the software, as a cluster-wide major version upgrade can
involve API changes and mandates a reasonably lengthy downtime window for
the metadata upgrade process. Rather than point those users at separate
repositories, we elected to package the two versions in parallel. This also
enables client machines that may speed to multiple separate clusters within
an enterprise to have both client versions installed.

My issues so far:
>
> start-dfs.sh started only the local namenode, not the secondary NN nor the
> datanodes without indicating any error. masters and slaves were configured
> correctly.
>
>
We generally do not advise users of the debian packages to use start-*.sh.
Rather, we recommend that they simply use the init scripts on each machine
in the cluster, usually automatically started at boot, or maintained running
using a configuration management system like puppet/chef/etc.

That said, there's no reason why those scripts shouldn't work correctly. Did
you run these scripts as the hadoop user, and ensure that the hadoop user
can ssh throughout the cluster without password entry (either using a
passphrase-less SSH key, or ssh-agent)?


> When starting the datanodes manually they were recognized and HDFS worked.
> However the web UIs at port 50075 show only a directory view with a
> WEB-INF/
> directory in it. This may most likely be a packaging/configuration issue.
> The same with the SNN on port 50090.
>
>
This is correct - the DN and 2NN in currently released versions of Hadoop do
not have web UIs. Port 50070 should be the NN Web UI, which should properly
show cluster statistics and allow you to browse the DFS.


> The SNN shows:
> java.io.FileNotFoundException:
>
> http://192.168.122.166:50070/getimage?putimage=1&port=50090&machine=127.0.1.1&token=-18:737152035:0:1262195990000:1262194649873
>  at
>
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1288
>
>
This sounds like HDFS-62:  https://issues.apache.org/jira/browse/HDFS-62

The workaround can be found here:
http://www.cloudera.com/blog/2009/02/10/multi-host-secondarynamenode-configuration/

I did not get MapReduce working yet. It seems that it would be very helpfull
> to have some more verbose example configuration files. These should have
> commented properties for the most important settings.
>
>
The entire configuration can be found in src/hdfs/hdfs-default.xml,
src/mapred/mapred-default.xml, and src/core/core-default.xml

We've put some basic configurations as well in the example-confs/ directory
of our distribution.

Thanks, and happy new year to you as well,

-Todd

Reply via email to