On Sun, Oct 10, 2010 at 11:36 PM, Joe Kumar <[email protected]> wrote:
> Drew / all,
>
> I have written a script (80% done) for running the clustering job on
> synthetic control data.
> Should I upload this in MAHOUT-520 or should i open a new jira issue ?

Great! I've revised MAHOUT-520's description to accomotate this, so
why don't you ga and attach the script/patch there.

> I m thinking of modifying the build-reuters.sh to make it more interactive.
> Currently it says "uncomment lines for kmeans or lda" but we can ask user
> the select whether they want to run kmeans or lda and invoke the command for
> those algos accordingly. I have done something similar for synthetic control
> data example.

Interactivity is >ok< but It would be excellent if the script did not
>require< interactivity, e.g was able to run with automatically
command-line arguments perhaps. This way they could be run as part of
the nightly build in hudson.

> When we are running some of the examples, we are checking if HADOOP_HOME is
> set. Sometimes HADOOP_HOME might be set but if hadoop is not running, then
> our examples would fail. so I am trying to see what would be the best way to
> check and make sure hadoop is up through shell script. Once I get this, the
> script for synthetic control data should be complete.

I'm not aware of best practices here, but 'hadoop -dfs ls' can be used
to check that the namenode is available, 'hadoop -job list' can be
used to check if the jobtracker is available. Each of these will retry
up to 10 times to contact the namenode or tasktracker, so there will
be a bit of a pause before an error if the service in question isn't
available.

Not sure the best way to obtain datanode/tasktracker health.

Drew

Reply via email to