17 maj 2008 kl. 21.01 skrev Isabel Drost:
On Saturday 17 May 2008, Lukas Vlcek wrote:
http://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&num
Att=&numIns=&type=&sort=taskUp&view=table
Some of those data sets are reasonably small so that they could be
integrated into Mahout unit tests by default (sounds like crazy
idea?).
Hmm. If we want to integrate them in unit tests, we should have a
look at the
license of these datasets. But for examples, it might be ok, if
users simply
download the dataset from the uci web page themselves.
+1
Actually, the Taste code already contains an example depends on a data
set from GroupLens the user must download. For examples I don't mind
at all, especially if the data is good. For unit tests we really think
we want data that can be redistributed by us.
Once again I'm taking the oppertunity to point out the synthetic data
generator at http://www.datasetgenerator.com/ is excellent for unit-
and load testing. The generator C source code has been donated by
Gabor Melli (thanks again) to Mahout and is available in the issue
tracker!
karl