17 maj 2008 kl. 21.01 skrev Isabel Drost:
On Saturday 17 May 2008, Lukas Vlcek wrote:
http://archive.ics.uci.edu/ml/datasets.html?format=&task=clu&att=&area=&num
Att=&numIns=&type=&sort=taskUp&view=table

Some of those data sets are reasonably small so that they could be
integrated into Mahout unit tests by default (sounds like crazy idea?).

Hmm. If we want to integrate them in unit tests, we should have a look at the license of these datasets. But for examples, it might be ok, if users simply
download the dataset from the uci web page themselves.

+1

Actually, the Taste code already contains an example depends on a data set from GroupLens the user must download. For examples I don't mind at all, especially if the data is good. For unit tests we really think we want data that can be redistributed by us.

Once again I'm taking the oppertunity to point out the synthetic data generator at http://www.datasetgenerator.com/ is excellent for unit- and load testing. The generator C source code has been donated by Gabor Melli (thanks again) to Mahout and is available in the issue tracker!


      karl

Reply via email to