Thanks for pointing to that link! It looks like it’s useful, but it does look more complicated than the case I’m trying to address.
In my case, we set y = f(x), then we use y later on in future projections (z = g(y)). In that case, the analysis is trivial in that we aren’t trying to find equivalent expressions, we actually know that z is based off of y. In addition, we are already storing off y because it’s one of the projections, so there’s no tradeoff between time vs memory. Perf gains I believe that the performance gains can be quite substantial, but can you check that the case I bring up below will indeed benefit from such a optimization? For example, suppose I have a date column (unclean_date) that is stored in some strange string format. I then use an udf or a hive function that converts it to the Catalyst date type (cleaned_date). Next, I want to extract one column with the month, and another with the year, so I can do groupBys/aggregations later. Currently, every projection/expression based off of the cleaned_date will have to do the expensive parsing again if I avoid caching and prefer to do everything in one pass. Code generation phase vs optimization Is there a reason why doing it at the optimization phase is the wrong approach? If sounds like we’re actually logically changing the order of computation if we do my proposed approach. I do agree however if there are lower hanging fruits, then we should tackle those first =) On Sat, May 30, 2015 at 10:00 PM Michael Armbrust <mich...@databricks.com> wrote: > I think this is likely something that we'll want to do during the code > generation phase. Though its probably not the lowest hanging fruit at this > point. > > On Sun, May 31, 2015 at 5:02 AM, Reynold Xin <r...@databricks.com> wrote: > >> I think you are looking for >> http://en.wikipedia.org/wiki/Common_subexpression_elimination in the >> optimizer. >> >> One thing to note is that as we do more and more optimization like this, >> the optimization time might increase. Do you see a case where this can >> bring you substantial performance gains? >> >> >> On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com> >> wrote: >> >>> On second thought, perhaps can this be done by writing a rule that >>> builds the dag of dependencies between expressions, then convert it into >>> several layers of projections, where each new layer is allowed to depend on >>> expression results from previous projections? >>> >>> Are there any pitfalls to this approach? >>> >>> On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com> >>> wrote: >>> >>>> If I do the following >>>> >>>> df2 = df.withColumn('y', df['x'] * 7) >>>> df3 = df2.withColumn('z', df2.y * 3) >>>> df3.explain() >>>> >>>> Then the result is >>>> >>>> > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS >>>> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65] >>>> > PhysicalRDD [date#56,id#57,timestamp#58,x#59], >>>> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163 >>>> >>>> Effectively I want to compute >>>> >>>> y = f(x) >>>> z = g(y) >>>> >>>> The catalyst optimizer realizes that y#64 is the same as the one >>>> previously computed, however, when building the projection, it is ignoring >>>> the fact that it had already computed y, so it calculates `x * 7` twice. >>>> >>>> y = x * 7 >>>> z = x * 7 * 3 >>>> >>>> If I wanted to make this fix, would it be possible to do the logic in >>>> the optimizer phase? I imagine that it's difficult because the expressions >>>> in InterpretedMutableProjection don't have access to the previous >>>> expression results, only the input row, and that the design doesn't seem to >>>> be catered for this. >>>> >>>> >> >