Nice idea, Amit. We'll definitely need validation suite that applies at the model level to catch things like aggregating an unbounded collection without windowing/triggers or not using a deterministic coder for key grouping. Adding optional hints/suggestions there too could be interesting. Or alternatively, if we tie it into the new DirectRunner we're working on, we might be able to give better suggestions by inspecting runtime characteristics.
There's no academic paper on the optimizations in Cloud Dataflow, but there are a few details in the documentation: https://cloud.google.com/dataflow/service/dataflow-service-desc Frances On Fri, Feb 19, 2016 at 8:36 AM, Amit Sela <[email protected]> wrote: > I agree with Frances, optimizations are mostly runner-specific, and Beam > will probably provide a library of common (well optimized) graph > transformations and composite transforms. > We could however provide some sort of pipeline "reviewer" by implementing a > PipelineVisitor that reviews the graph of transformation that is about to > be executed. For example: if my graph uses GroupByKey, I'll be notified > that Combine.perKey could be more efficient (which is right in most cases, > though not all). > > Amit > > On Fri, Feb 19, 2016 at 9:37 AM lonely Feb <[email protected]> wrote: > > > Sounds right. Is there any papers about how Cloud Dataflow work on > > optimization? Spark, as far as i known, will not change the execution > order > > of your transforms but rely on lazy operation and DAG scheduler pipeline. > > > > 2016-02-19 13:20 GMT+08:00 Frances Perry <[email protected]>: > > > > > (I'm not familiar with the details of Catalyst itself.) > > > > > > The existing runners (Cloud Dataflow, Spark, Flink) all do > optimizations > > of > > > their own, though it's quite likely there's a set of optimizations that > > are > > > conceptually shared. For example, something like ParDo fusion is pretty > > > basic to executing the Beam model. However, even that could be tuned > very > > > differently depending on the backend you are targeting. So I don't > think > > we > > > should have a shared optimizer for all of Beam. However, if there's a > set > > > of graph transformations that are useful to multiple runners, it'd be > > great > > > to have them written in a general way and put in some kind of runner > util > > > package. > > > > > > Frances > > > > > > On Thu, Feb 18, 2016 at 6:37 PM, lonely Feb <[email protected]> > > wrote: > > > > > > > Should we have a common optimization framework for BEAM which just > same > > > as > > > > Spark Catalyst? Optimization is so significant but it seems that we > > have > > > no > > > > plans for it? > > > > > > > > > >
