Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
Take a look at Kaggle competition datasets - https://www.kaggle.com/competitions




For svm there are a couple of ad click prediction datasets of pretty large size.




For graph stuff the SNAP has large network data: https://snap.stanford.edu/data/



—
Sent from Mailbox

On Thu, Jul 3, 2014 at 3:25 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Hello!
 I want to play around with several different cluster settings and measure
 performances for MLlib and GraphX  and was wondering if anybody here could
 hit me up with datasets for these applications from 5GB onwards? 
 I mostly interested in SVM and Triangle Count, but would be glad for any
 help.
 Best regards,
 Alex
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread AlexanderRiggers
Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions

I was looking for files in LIBSVM format and never found something on Kaggle
in bigger size. Most competitions I ve seen need data processing and feature
generating, but maybe I ve to take a second look.


Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/

Thanks




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Re: Sample datasets for MLlib and Graphx

2014-07-03 Thread Nick Pentreath
The Kaggle data is not in libsvm format so you'd have to do some transformation.


The Criteo and KDD cup datasets are if I recall fairly large. Criteo ad 
prediction data is around 2-3GB compressed I think.




To my knowledge these are the largest binary classification datasets I've come 
across which are easily publicly available (very happy to be proved wrong about 
this though :)
—
Sent from Mailbox

On Thu, Jul 3, 2014 at 4:39 PM, AlexanderRiggers
alexander.rigg...@gmail.com wrote:

 Nick Pentreath wrote
 Take a look at Kaggle competition datasets
 - https://www.kaggle.com/competitions
 I was looking for files in LIBSVM format and never found something on Kaggle
 in bigger size. Most competitions I ve seen need data processing and feature
 generating, but maybe I ve to take a second look.
 Nick Pentreath wrote
 For graph stuff the SNAP has large network
 data: https://snap.stanford.edu/data/
 Thanks
 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/Sample-datasets-for-MLlib-and-Graphx-tp8760p8762.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.