The Kaggle data is not in libsvm format so you'd have to do some transformation.


The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad 
prediction data is around 2-3GB compressed I think.




To my knowledge these are the largest binary classification datasets I've come 
across which are easily publicly available (very happy to be proved wrong about 
this though :)
—
Sent from Mailbox

On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
<alexander.rigg...@gmail.com> wrote:

> Nick Pentreath wrote
>> Take a look at Kaggle competition datasets
>> - https://www.kaggle.com/competitions
> I was looking for files in LIBSVM format and never found something on Kaggle
> in bigger size. Most competitions I ve seen need data processing and feature
> generating, but maybe I ve to take a second look.
> Nick Pentreath wrote
>> For graph stuff the SNAP has large network
>> data: https://snap.stanford.edu/data/
> Thanks
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to