Sorry Furkan! I did mean you! Please excuse me. Another interesting resource could be looking into http://www.pdl.cmu.edu/ycsb++/
Best, Renato M. 2015-10-30 0:31 GMT+01:00 Furkan KAMACI <[email protected]>: > Hi Renato, > > I think you wanted to mention me :) My main purpose is to compare Spark and > GoraSparkEngine. Spark uses K-Means, Logistic Regression, Expectation > Maximization and Alternating Least Squares at its papers for performance > benchmarking with Hadoop Map/Reduce (also a task which loads 39 GB dump of > Wikipedia into memory and runs queries on it) and thats why I want to run > it on two different datasets. > > Kind Regards, > Furkan KAMACI > > On Fri, Oct 30, 2015 at 1:20 AM, Renato Marroquín Mogrovejo < > [email protected]> wrote: > > > Hi Talat, > > > > This would be great! This is something that might be really interesting > and > > useful for Gora's community! > > I think you could use as a baseline, the native data access of Spark to > > different available stores, and then compare it against the GoraRDD > > integration. We have GoraCI which has been thought as a continuous > > ingestion test to verify that Gora doesn't loose data when doing a > > distributed job, but it doesn't take into account the overhead of > actually > > using Gora as a middleware. > > Choosing a cpu-bounded algorithm could be a second interesting step > because > > if you use for example an iterative algorithm from the start, then the > many > > layers of caching might make the benefits/drawbacks of using Gora > difficult > > to observe (Spark's internal caching mechanism, the OperatingSystem > > caching, and Gora holding the in-memory until it is flushed). What I am > > trying to say is that the results will depend on the algorithm chosen, > and > > the type of caching it takes advantage of (temporal or spatial locality). > > > > > > Renato M. > > > > 2015-10-26 15:36 GMT+01:00 Furkan KAMACI <[email protected]>: > > > > > Hi All, > > > > > > I want to prepare a benchmark and presentation for my Spark Backend of > > Gora > > > with help of Talat. I am planning to follow the approach of > benchmarking > > > for Spark by University of California, Berkeley [1][2]. > > > > > > Dimensions of my benchmark: > > > > > > * Hadoop Map/Reduce > > > * Spark > > > * Hadoop Map/Reduce via Gora > > > * Spark via Gora > > > > > > For that aim, I would like to work on two types of dataset: > > > > > > 1) Data-intensive > > > 2) CPU-intensive > > > > > > First of all, is there any benchmark which presents the performance > > effect > > > of using Gora for Hadoop/MapReduce? > > > > > > Secondly, do you suggest any dataset (or tool) for my purposes (i.e. > > > Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4], > > > etc)? > > > > > > > > > Kind Regards, > > > Furkan KAMACI > > > > > > [1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf > > > [2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf > > > [3] > > > > > > > > > http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/ > > > [4] https://github.com/intel-hadoop/HiBench > > > > > >

