@Jason, Just some additional refs for ideas, since I already researched a little bit about how people evaluated this in other Apache projects.
Yahoo published one benchmarking analysis in different streaming frameworks like a year ago: https://github.com/yahoo/streaming-benchmarks And the flink guys extended it: https://github.com/dataArtisans/yahoo-streaming-benchmark Notice that the common approach comes from the classical database world, and it is to take one of the TPC queries suites (TPC-H or TPC-DS) and evaluate a data processing framework against it, Spark does this to evaluate their SQL performance. https://github.com/databricks/spark-sql-perf However this approach is not 100% aligned with Beam because AFAIK there is not a TPC suite for continuous processing, that's the reason why I found the NexMark suite as a more appropriate example. On Tue, Oct 18, 2016 at 9:50 PM, Ismaël Mejía <[email protected]> wrote: > Hello, > > Now that we are discussing about the subject of performance testing, I > want to > jump into the conversation to remind everybody that we have a really > interesting > benchmarking suite already contributed by google that has (sadly) not been > merged yet. > > https://github.com/apache/incubator-beam/pull/366 > https://issues.apache.org/jira/browse/BEAM-160 > > This is not exactly the kind of benchmark of the current discussion, but > for me > is a super valuable contribution that I hope we can use/refine to evaluate > the > runners. > > Ismaël Mejía > > > On Tue, Oct 18, 2016 at 8:16 PM, Jean-Baptiste Onofré <[email protected]> > wrote: > >> It sounds like a good idea to me. >> >> Regards >> JB >> >> >> On 10/18/2016 08:08 PM, Amit Sela wrote: >> >>> @Jesse how about runners "tracing" the constructed DAG (by Beam) so that >>> it's clear what the runner actually executed ? >>> >>> Example: >>> For the SparkRunner, a ParDo translates to a mapPartitions >>> transformation. >>> >>> That could provide transparency when debugging/benchmarking pipelines >>> per-runner. >>> >>> On Tue, Oct 18, 2016 at 8:25 PM Jesse Anderson <[email protected]> >>> wrote: >>> >>> @Dan before starting with Beam, I'd want to know how much performance >>>> I've >>>> giving up by not programming directly to the API. >>>> >>>> On Tue, Oct 18, 2016 at 10:03 AM Dan Halperin >>>> <[email protected] >>>> >>>>> >>>>> wrote: >>>> >>>> I think there are lots of excellent one-off performance studies, but I'm >>>>> not sure how useful that is to Beam. >>>>> >>>>> From a test infra point of view, I'm wondering more about tracking of >>>>> performance over time, identifying regressions, etc. >>>>> >>>>> Google has some tools like PerfKit >>>>> <https://github.com/GoogleCloudPlatform/PerfKitBenchmarker> which is >>>>> basically a skin on a database + some scripts to load and query data; >>>>> >>>> but I >>>> >>>>> don't love it. Do other Apache projects do public, long-term >>>>> benchmarking >>>>> and performance regression testing? >>>>> >>>>> Dan >>>>> >>>>> On Tue, Oct 18, 2016 at 8:52 AM, Jesse Anderson <[email protected] >>>>> > >>>>> wrote: >>>>> >>>>> I found data Artisan's benchmarking post >>>>>> <http://data-artisans.com/high-throughput-low-latency-and- >>>>>> exactly-once-stream-processing-with-apache-flink/>. >>>>>> They also shared the code <https://github.com/dataArtisa >>>>>> ns/performance >>>>>> >>>>> . >>>>> I >>>>> >>>>>> didn't dig in much, but they did a wide range of algorithms. They have >>>>>> >>>>> the >>>>> >>>>>> native code, so you write the Beam code and check against the native >>>>>> performance. >>>>>> >>>>>> On Mon, Oct 17, 2016 at 5:14 PM amir bahmanyari >>>>>> <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hi Jason,I have been busy bench-marking Flink Cluster (Spark next) >>>>>>> >>>>>> under >>>>> >>>>>> Beam.I can share my experience. Can you list items of interest to >>>>>>> >>>>>> know >>>> >>>>> so I >>>>>> >>>>>>> can answer them to the best of my knowledge.Cheers >>>>>>> >>>>>>> From: Jason Kuster <[email protected]> >>>>>>> To: [email protected] >>>>>>> Sent: Monday, October 17, 2016 5:06 PM >>>>>>> Subject: Exploring Performance Testing >>>>>>> >>>>>>> Hey all, >>>>>>> >>>>>>> Now that we've covered some of the initial ground with regard to >>>>>>> correctness testing, I'm going to be starting work on performance >>>>>>> >>>>>> testing >>>>> >>>>>> and benchmarking. I wanted to reach out and see what people's >>>>>>> >>>>>> experiences >>>>> >>>>>> have been with performance testing and benchmarking >>>>>>> frameworks, particularly in other Apache projects. Anyone have any >>>>>>> experience or thoughts? >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> Jason >>>>>>> >>>>>>> -- >>>>>>> ------- >>>>>>> Jason Kuster >>>>>>> Apache Beam (Incubating) / Google Cloud Dataflow >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> -- >> Jean-Baptiste Onofré >> [email protected] >> http://blog.nanthrax.net >> Talend - http://www.talend.com >> > >
