Drew, Thanks for your suggestions.
I have modified the script to enable a non-interactive mode by calling the script with a parameter "-ni". This will just run the Canopy Clustering. I am not sure if it should run all the clustering algos. any thots ? By default the script will be in an interactive mode. so when users invoke the script they'll interact by choosing the clustering algorithm. Let me know if this is ok, so I'll test the script and upload the same tonite. Regarding the check on hadoop, I was thinking of "hadoop fs -ls" but then wasnt really sure. Anyhow, I am currently going with this but if there is a better way we can adapt that later i guess. Also shall I change the build-reuters.sh for the interactive / non-interactive mode ? regards Joe. On Mon, Oct 11, 2010 at 9:14 AM, Drew Farris <[email protected]> wrote: > On Sun, Oct 10, 2010 at 11:36 PM, Joe Kumar <[email protected]> wrote: > > Drew / all, > > > > I have written a script (80% done) for running the clustering job on > > synthetic control data. > > Should I upload this in MAHOUT-520 or should i open a new jira issue ? > > Great! I've revised MAHOUT-520's description to accomotate this, so > why don't you ga and attach the script/patch there. > > > I m thinking of modifying the build-reuters.sh to make it more > interactive. > > Currently it says "uncomment lines for kmeans or lda" but we can ask user > > the select whether they want to run kmeans or lda and invoke the command > for > > those algos accordingly. I have done something similar for synthetic > control > > data example. > > Interactivity is >ok< but It would be excellent if the script did not > >require< interactivity, e.g was able to run with automatically > command-line arguments perhaps. This way they could be run as part of > the nightly build in hudson. > > > When we are running some of the examples, we are checking if HADOOP_HOME > is > > set. Sometimes HADOOP_HOME might be set but if hadoop is not running, > then > > our examples would fail. so I am trying to see what would be the best way > to > > check and make sure hadoop is up through shell script. Once I get this, > the > > script for synthetic control data should be complete. > > I'm not aware of best practices here, but 'hadoop -dfs ls' can be used > to check that the namenode is available, 'hadoop -job list' can be > used to check if the jobtracker is available. Each of these will retry > up to 10 times to contact the namenode or tasktracker, so there will > be a bit of a pause before an error if the service in question isn't > available. > > Not sure the best way to obtain datanode/tasktracker health. > > Drew >
