Grant Ingersoll wrote:
Anyone have any sample code or demo of running the clustering over a large collection of documents that they could share? Mainly looking for an example of taking some corpus, converting it into the appropriate Mahout representation and then running either the k-means or the canopy clustering on it.

Thanks,
Grant


I've been experimenting with Hadoop deployments on EC2 and have managed deploy a single node cluster using an AMI I built from the latest trunk version (0.18.0). I'm waiting for 0.17.0 to be released since it has much nicer DNS support than (0.16.x) for deploying EC2 clusters. At that point there should be a public 0.17.0 AMI that we all can use. I could probably hack the scripts to make mine work but this is a little out of my comfort zone and 17 is imminent.

If we can identify some datasets that can be easily downloaded I will put copies in S3 so that they can be easily copied into the cloud once that is ready. I've run canopy over some Apache logs in my previous life but the kinds of datasets under discussion sound much more interesting.

Jeff


Reply via email to