Grant Ingersoll skrev:
Anyone have any sample code or demo of running the clustering over a
large collection of documents that they could share? Mainly looking for
an example of taking some corpus, converting it into the appropriate
Mahout representation and then running either the k-means or the canopy
clustering on it.
There is the rule based data set generation in MAHOUT-43.
http://www.datasetgenerator.com
Push a few buttons and you have an insane amount of OK test data
according to your specifications. That is what I have been using.
There is also this contact I have with these guys that produce news
article data for indexing. The data is nicly organized and they have
previously offered looking in to committer access to it for local tests.
I have a number of data sets I'm not certain about who owns them. For
instance I've been gathering real estate data for Sweden for some time
as the sites I was using to find an appartment did not work the way I
wanted them to :)
karl