Hi Talat,

This would be great! This is something that might be really interesting and
useful for Gora's community!
I think you could use as a baseline, the native data access of Spark to
different available stores, and then compare it against the GoraRDD
integration. We have GoraCI which has been thought as a continuous
ingestion test to verify that Gora doesn't loose data when doing a
distributed job, but it doesn't take into account the overhead of actually
using Gora as a middleware.
Choosing a cpu-bounded algorithm could be a second interesting step because
if you use for example an iterative algorithm from the start, then the many
layers of caching might make the benefits/drawbacks of using Gora difficult
to observe (Spark's internal caching mechanism, the OperatingSystem
caching, and Gora holding the in-memory until it is flushed). What I am
trying to say is that the results will depend on the algorithm chosen, and
the type of caching it takes advantage of (temporal or spatial locality).


Renato M.

2015-10-26 15:36 GMT+01:00 Furkan KAMACI <[email protected]>:

> Hi All,
>
> I want to prepare a benchmark and presentation for my Spark Backend of Gora
> with help of Talat. I am planning to follow the approach of benchmarking
> for Spark by University of California, Berkeley [1][2].
>
> Dimensions of my benchmark:
>
> * Hadoop Map/Reduce
> * Spark
> * Hadoop Map/Reduce via Gora
> * Spark via Gora
>
> For that aim, I would like to work on two types of dataset:
>
> 1) Data-intensive
> 2) CPU-intensive
>
> First of all, is there any benchmark which presents the performance effect
> of using Gora for Hadoop/MapReduce?
>
> Secondly, do you suggest any dataset (or tool) for my purposes (i.e.
> Logistic Regression, PageRank, TeraSort [3], Intel-Hadoop Benchmark[4],
> etc)?
>
>
> Kind Regards,
> Furkan KAMACI
>
> [1] https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf
> [2] http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf
> [3]
>
> http://www.michael-noll.com/blog/2011/04/09/benchmarking-and-stress-testing-an-hadoop-cluster-with-terasort-testdfsio-nnbench-mrbench/
> [4] https://github.com/intel-hadoop/HiBench
>

Reply via email to