Grant Ingersoll wrote:
Anyone have any sample code or demo of running the clustering over a
large collection of documents that they could share? Mainly looking
for an example of taking some corpus, converting it into the
appropriate Mahout representation and then running either the k-means
or the canopy clustering on it.
Thanks,
Grant
I've been experimenting with Hadoop deployments on EC2 and have managed
deploy a single node cluster using an AMI I built from the latest trunk
version (0.18.0). I'm waiting for 0.17.0 to be released since it has
much nicer DNS support than (0.16.x) for deploying EC2 clusters. At that
point there should be a public 0.17.0 AMI that we all can use. I could
probably hack the scripts to make mine work but this is a little out of
my comfort zone and 17 is imminent.
If we can identify some datasets that can be easily downloaded I will
put copies in S3 so that they can be easily copied into the cloud once
that is ready. I've run canopy over some Apache logs in my previous life
but the kinds of datasets under discussion sound much more interesting.
Jeff