I recommend using the data generators provided with MLlib to generate synthetic data for your scalability tests - provided they're well suited for your algorithms. They let you control things like number of examples and dimensionality of your dataset, as well as number of partitions.
As far as cluster set up goes, I usually launch spot instances with the spark-ec2 scripts, and then check out a repo which contains a simple driver application for my code. Then I have something crude like bash scripts running my program and collecting output. You could have a look at the spark-perf repo if you want something a little better principled/automatic. - Evan > On Oct 2, 2014, at 5:37 PM, Yu Ishikawa <yuu.ishikawa+sp...@gmail.com> wrote: > > Hi all, > > I am trying to contribute some machine learning algorithms to MLlib. > I must evaluate their performance on a cluster, changing input data > size, the number of CPU cores and any their parameters. > > I would like to build my develoipng Spark on EC2 automatically. > Is there already a building script for a developing version like spark-ec2 > script? > Or if you have any good idea to evaluate the performance of a developing > MLlib algorithm on a spark cluster like EC2, could you tell me? > > Best, > > > > ----- > -- Yu Ishikawa > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/What-is-the-best-way-to-build-my-developing-Spark-for-testing-on-EC2-tp8638.html > Sent from the Apache Spark Developers List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org