Hi! Thanks for reaching out and adding Flink to BigDataBench.
The plan you sent looks like a nice first draft. It is pretty much batch jobs. Here are a few ideas what you could add as batch jobs: - Joins are something people seem do a lot with these systems, so a 2-3 table join would be a nice addition - For batch algorithms, it is often interesting to scale them beyond memory (we have seen that a lot from users) - For graph algorithms, you can try incremental versions (see here: http://data-artisans.com/data-analysis-with-flink.html) On the streaming side, it is harder, as the systems are very different there and bot every system can do everything. For Flink, some ideas would be: - Streaming Grep - Streaming pattern detection (see https://github.com/StephanEwen/flink-demos/tree/master/streaming-state-machine ) - Streaming word count - For streaming Jobs, it is often interesting to play with enabled / disabled fault tolerance A few generic comments on Flink, for performance testing. - The Java API is usually slightly faster then the Scala API, but only by a bit - Tuples (Java) and case classes (Scala) usually beat POJOs in performance. - If your implementation allows it, turning on "objectReuseMode()" can gain some performance. - If you implement sorting / Tera sort, have a look here, for how to make sure that Flink handles the Hadoop types efficiently http://eastcirclek.blogspot.kr/2015/06/terasort-for-spark-and-flink-with-range.html Greetings, Stephan On Mon, Jul 20, 2015 at 9:47 AM, Xinhui Tian <tianxin...@ict.ac.cn> wrote: > Hello, everyone. > > I'm a PhD student from the Institute of Computing Technology, Chinese > Academy of Sciences. Our team has released a benchmark for big data systems > called BigDataBench, which has become an industry-standard big data > benchmark in China. You can find our work on this website: > http://prof.ict.ac.cn/BigDataBench/ > > We are now planning to support Flink in our benchmark, which could provide > a > set of workloads on different domains and an objective comparison with > systems such as Spark and Hadoop. But we are new to this system, so we are > asking for your advice about benchmark design. The first thing is to decide > what workloads should be added to our benchmark and which domain we should > pay more attention. > > The attachment is a preliminary plan, which lists some workloads that have > already been implemented in the Spark version. We plan to first implement > these workloads on Flink, and evalute these two systems. Does anyone have > some adivce for this list? We will be very grateful for any idea. > BigDataBench_for_Flink.docx > < > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/file/n7079/BigDataBench_for_Flink.docx > > > > Thanks ;) > > > > -- > View this message in context: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Benchmarks-of-Flink-supporting-Flink-in-BigDataBench-tp7079.html > Sent from the Apache Flink Mailing List archive. mailing list archive at > Nabble.com. >