Re: Benchmarks of Flink, supporting Flink in BigDataBench

Stephan Ewen Mon, 20 Jul 2015 01:21:35 -0700

Hi!

Thanks for reaching out and adding Flink to BigDataBench.


The plan you sent looks like a nice first draft. It is pretty much batch
jobs. Here are a few ideas what you could add as batch jobs:

 - Joins are something people seem do a lot with these systems, so a 2-3
table join would be a nice addition

 - For batch algorithms, it is often interesting to scale them beyond
memory (we have seen that a lot from users)

 - For graph algorithms, you can try incremental versions (see here:
http://data-artisans.com/data-analysis-with-flink.html)



On the streaming side, it is harder, as the systems are very different
there and bot every system can do everything.
For Flink, some ideas would be:
  - Streaming Grep
  - Streaming pattern detection (see
https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine
)
  - Streaming word count
  - For streaming Jobs, it is often interesting to play with enabled /
disabled fault tolerance



A few generic comments on Flink, for performance testing.

 - The Java API is usually slightly faster then the Scala API, but only by
a bit
 - Tuples (Java) and case classes (Scala) usually beat POJOs in performance.
 - If your implementation allows it, turning on "objectReuseMode()" can
gain some performance.
 - If you implement sorting / Tera sort, have a look here, for how to make
sure that Flink handles the Hadoop types efficiently
http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html

Greetings,
Stephan



On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <tianxin...@ict.ac.cn> wrote:

> Hello, everyone.
>
> I'm a PhD student from the Institute of Computing Technology, Chinese
> Academy of Sciences. Our team has released a benchmark for big data systems
> called BigDataBench, which has become an industry-standard big data
> benchmark in China. You can find our work on this website:
> http://prof.ict.ac.cn/BigDataBench/
>
> We are now planning to support Flink in our benchmark, which could provide
> a
> set of workloads on different domains and an objective comparison with
> systems such as Spark and Hadoop. But we are new to this system, so we are
> asking for your advice about benchmark design. The first thing is to decide
> what workloads should be added to our benchmark and which domain we should
> pay more attention.
>
> The attachment is a preliminary plan, which lists some workloads that have
> already been implemented in the Spark version. We plan to first implement
> these workloads on Flink, and evalute these two systems. Does anyone have
> some adivce for this list? We will be very grateful for any idea.
> BigDataBench_for_Flink.docx
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx
> >
>
> Thanks ;)
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>

Re: Benchmarks of Flink, supporting Flink in BigDataBench

Reply via email to