I agree with Frances, optimizations are mostly runner-specific, and Beam will probably provide a library of common (well optimized) graph transformations and composite transforms. We could however provide some sort of pipeline "reviewer" by implementing a PipelineVisitor that reviews the graph of transformation that is about to be executed. For example: if my graph uses GroupByKey, I'll be notified that Combine.perKey could be more efficient (which is right in most cases, though not all).
Amit On Fri, Feb 19, 2016 at 9:37 AM lonely Feb <[email protected]> wrote: > Sounds right. Is there any papers about how Cloud Dataflow work on > optimization? Spark, as far as i known, will not change the execution order > of your transforms but rely on lazy operation and DAG scheduler pipeline. > > 2016-02-19 13:20 GMT+08:00 Frances Perry <[email protected]>: > > > (I'm not familiar with the details of Catalyst itself.) > > > > The existing runners (Cloud Dataflow, Spark, Flink) all do optimizations > of > > their own, though it's quite likely there's a set of optimizations that > are > > conceptually shared. For example, something like ParDo fusion is pretty > > basic to executing the Beam model. However, even that could be tuned very > > differently depending on the backend you are targeting. So I don't think > we > > should have a shared optimizer for all of Beam. However, if there's a set > > of graph transformations that are useful to multiple runners, it'd be > great > > to have them written in a general way and put in some kind of runner util > > package. > > > > Frances > > > > On Thu, Feb 18, 2016 at 6:37 PM, lonely Feb <[email protected]> > wrote: > > > > > Should we have a common optimization framework for BEAM which just same > > as > > > Spark Catalyst? Optimization is so significant but it seems that we > have > > no > > > plans for it? > > > > > >
