I think Michael's bringing up code gen because the compiler (not Spark, but javac and JVM JIT) already does common subexpression elimination, so we might get it for free during code gen.
On Sun, May 31, 2015 at 11:48 AM, Justin Uang <justin.u...@gmail.com> wrote: > Thanks for pointing to that link! It looks like it’s useful, but it does > look more complicated than the case I’m trying to address. > > In my case, we set y = f(x), then we use y later on in future projections > (z = g(y)). In that case, the analysis is trivial in that we aren’t > trying to find equivalent expressions, we actually know that z is based > off of y. In addition, we are already storing off y because it’s one of > the projections, so there’s no tradeoff between time vs memory. > Perf gains > > I believe that the performance gains can be quite substantial, but can you > check that the case I bring up below will indeed benefit from such a > optimization? > > For example, suppose I have a date column (unclean_date) that is stored > in some strange string format. I then use an udf or a hive function that > converts it to the Catalyst date type (cleaned_date). Next, I want to > extract one column with the month, and another with the year, so I can do > groupBys/aggregations later. Currently, every projection/expression based > off of the cleaned_date will have to do the expensive parsing again if I > avoid caching and prefer to do everything in one pass. > Code generation phase vs optimization > > Is there a reason why doing it at the optimization phase is the wrong > approach? If sounds like we’re actually logically changing the order of > computation if we do my proposed approach. I do agree however if there are > lower hanging fruits, then we should tackle those first =) > > > On Sat, May 30, 2015 at 10:00 PM Michael Armbrust <mich...@databricks.com> > wrote: > >> I think this is likely something that we'll want to do during the code >> generation phase. Though its probably not the lowest hanging fruit at this >> point. >> >> On Sun, May 31, 2015 at 5:02 AM, Reynold Xin <r...@databricks.com> wrote: >> >>> I think you are looking for >>> http://en.wikipedia.org/wiki/Common_subexpression_elimination in the >>> optimizer. >>> >>> One thing to note is that as we do more and more optimization like this, >>> the optimization time might increase. Do you see a case where this can >>> bring you substantial performance gains? >>> >>> >>> On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com> >>> wrote: >>> >>>> On second thought, perhaps can this be done by writing a rule that >>>> builds the dag of dependencies between expressions, then convert it into >>>> several layers of projections, where each new layer is allowed to depend on >>>> expression results from previous projections? >>>> >>>> Are there any pitfalls to this approach? >>>> >>>> On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com> >>>> wrote: >>>> >>>>> If I do the following >>>>> >>>>> df2 = df.withColumn('y', df['x'] * 7) >>>>> df3 = df2.withColumn('z', df2.y * 3) >>>>> df3.explain() >>>>> >>>>> Then the result is >>>>> >>>>> > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS >>>>> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] >>>>> > PhysicalRDD [date#56,id#57,timestamp#58,x#59], >>>>> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163 >>>>> >>>>> Effectively I want to compute >>>>> >>>>> y = f(x) >>>>> z = g(y) >>>>> >>>>> The catalyst optimizer realizes that y#64 is the same as the one >>>>> previously computed, however, when building the projection, it is ignoring >>>>> the fact that it had already computed y, so it calculates `x * 7` twice. >>>>> >>>>> y = x * 7 >>>>> z = x * 7 * 3 >>>>> >>>>> If I wanted to make this fix, would it be possible to do the logic in >>>>> the optimizer phase? I imagine that it's difficult because the expressions >>>>> in InterpretedMutableProjection don't have access to the previous >>>>> expression results, only the input row, and that the design doesn't seem >>>>> to >>>>> be catered for this. >>>>> >>>>> >>> >>