FYI, there was a PR which was outstanding which was about adding the Nexmark suite: https://github.com/apache/incubator-beam/pull/366
On Tue, Oct 18, 2016 at 1:12 PM, Ismaël Mejía <[email protected]> wrote: > @Jason, Just some additional refs for ideas, since I already researched a > little > bit about how people evaluated this in other Apache projects. > > Yahoo published one benchmarking analysis in different streaming frameworks > like > a year ago: > https://github.com/yahoo/streaming-benchmarks > > And the flink guys extended it: > https://github.com/dataArtisans/yahoo-streaming-benchmark > > Notice that the common approach comes from the classical database world, > and it > is to take one of the TPC queries suites (TPC-H or TPC-DS) and evaluate a > data > processing framework against it, Spark does this to evaluate their SQL > performance. > > https://github.com/databricks/spark-sql-perf > > However this approach is not 100% aligned with Beam because AFAIK there is > not a > TPC suite for continuous processing, that's the reason why I found the > NexMark > suite as a more appropriate example. > > > On Tue, Oct 18, 2016 at 9:50 PM, Ismaël Mejía <[email protected]> wrote: > > > Hello, > > > > Now that we are discussing about the subject of performance testing, I > > want to > > jump into the conversation to remind everybody that we have a really > > interesting > > benchmarking suite already contributed by google that has (sadly) not > been > > merged yet. > > > > https://github.com/apache/incubator-beam/pull/366 > > https://issues.apache.org/jira/browse/BEAM-160 > > > > This is not exactly the kind of benchmark of the current discussion, but > > for me > > is a super valuable contribution that I hope we can use/refine to > evaluate > > the > > runners. > > > > Ismaël Mejía > > > > > > On Tue, Oct 18, 2016 at 8:16 PM, Jean-Baptiste Onofré <[email protected]> > > wrote: > > > >> It sounds like a good idea to me. > >> > >> Regards > >> JB > >> > >> > >> On 10/18/2016 08:08 PM, Amit Sela wrote: > >> > >>> @Jesse how about runners "tracing" the constructed DAG (by Beam) so > that > >>> it's clear what the runner actually executed ? > >>> > >>> Example: > >>> For the SparkRunner, a ParDo translates to a mapPartitions > >>> transformation. > >>> > >>> That could provide transparency when debugging/benchmarking pipelines > >>> per-runner. > >>> > >>> On Tue, Oct 18, 2016 at 8:25 PM Jesse Anderson <[email protected]> > >>> wrote: > >>> > >>> @Dan before starting with Beam, I'd want to know how much performance > >>>> I've > >>>> giving up by not programming directly to the API. > >>>> > >>>> On Tue, Oct 18, 2016 at 10:03 AM Dan Halperin > >>>> <[email protected] > >>>> > >>>>> > >>>>> wrote: > >>>> > >>>> I think there are lots of excellent one-off performance studies, but > I'm > >>>>> not sure how useful that is to Beam. > >>>>> > >>>>> From a test infra point of view, I'm wondering more about tracking of > >>>>> performance over time, identifying regressions, etc. > >>>>> > >>>>> Google has some tools like PerfKit > >>>>> <https://github.com/GoogleCloudPlatform/PerfKitBenchmarker> which is > >>>>> basically a skin on a database + some scripts to load and query data; > >>>>> > >>>> but I > >>>> > >>>>> don't love it. Do other Apache projects do public, long-term > >>>>> benchmarking > >>>>> and performance regression testing? > >>>>> > >>>>> Dan > >>>>> > >>>>> On Tue, Oct 18, 2016 at 8:52 AM, Jesse Anderson < > [email protected] > >>>>> > > >>>>> wrote: > >>>>> > >>>>> I found data Artisan's benchmarking post > >>>>>> <http://data-artisans.com/high-throughput-low-latency-and- > >>>>>> exactly-once-stream-processing-with-apache-flink/>. > >>>>>> They also shared the code <https://github.com/dataArtisa > >>>>>> ns/performance > >>>>>> > >>>>> . > >>>>> I > >>>>> > >>>>>> didn't dig in much, but they did a wide range of algorithms. They > have > >>>>>> > >>>>> the > >>>>> > >>>>>> native code, so you write the Beam code and check against the native > >>>>>> performance. > >>>>>> > >>>>>> On Mon, Oct 17, 2016 at 5:14 PM amir bahmanyari > >>>>>> <[email protected]> > >>>>>> wrote: > >>>>>> > >>>>>> Hi Jason,I have been busy bench-marking Flink Cluster (Spark next) > >>>>>>> > >>>>>> under > >>>>> > >>>>>> Beam.I can share my experience. Can you list items of interest to > >>>>>>> > >>>>>> know > >>>> > >>>>> so I > >>>>>> > >>>>>>> can answer them to the best of my knowledge.Cheers > >>>>>>> > >>>>>>> From: Jason Kuster <[email protected]> > >>>>>>> To: [email protected] > >>>>>>> Sent: Monday, October 17, 2016 5:06 PM > >>>>>>> Subject: Exploring Performance Testing > >>>>>>> > >>>>>>> Hey all, > >>>>>>> > >>>>>>> Now that we've covered some of the initial ground with regard to > >>>>>>> correctness testing, I'm going to be starting work on performance > >>>>>>> > >>>>>> testing > >>>>> > >>>>>> and benchmarking. I wanted to reach out and see what people's > >>>>>>> > >>>>>> experiences > >>>>> > >>>>>> have been with performance testing and benchmarking > >>>>>>> frameworks, particularly in other Apache projects. Anyone have any > >>>>>>> experience or thoughts? > >>>>>>> > >>>>>>> Best, > >>>>>>> > >>>>>>> Jason > >>>>>>> > >>>>>>> -- > >>>>>>> ------- > >>>>>>> Jason Kuster > >>>>>>> Apache Beam (Incubating) / Google Cloud Dataflow > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >> -- > >> Jean-Baptiste Onofré > >> [email protected] > >> http://blog.nanthrax.net > >> Talend - http://www.talend.com > >> > > > > >
