Thanks for pointing to that link! It looks like it’s useful, but it does
look more complicated than the case I’m trying to address.

In my case, we set y = f(x), then we use y later on in future projections (z
= g(y)). In that case, the analysis is trivial in that we aren’t trying to
find equivalent expressions, we actually know that z is based off of y. In
addition, we are already storing off y because it’s one of the projections,
so there’s no tradeoff between time vs memory.
Perf gains

I believe that the performance gains can be quite substantial, but can you
check that the case I bring up below will indeed benefit from such a
optimization?

For example, suppose I have a date column (unclean_date) that is stored in
some strange string format. I then use an udf or a hive function that
converts it to the Catalyst date type (cleaned_date). Next, I want to
extract one column with the month, and another with the year, so I can do
groupBys/aggregations later. Currently, every projection/expression based
off of the cleaned_date will have to do the expensive parsing again if I
avoid caching and prefer to do everything in one pass.
Code generation phase vs optimization

Is there a reason why doing it at the optimization phase is the wrong
approach? If sounds like we’re actually logically changing the order of
computation if we do my proposed approach. I do agree however if there are
lower hanging fruits, then we should tackle those first =)
​

On Sat, May 30, 2015 at 10:00 PM Michael Armbrust <mich...@databricks.com>
wrote:

> I think this is likely something that we'll want to do during the code
> generation phase.  Though its probably not the lowest hanging fruit at this
> point.
>
> On Sun, May 31, 2015 at 5:02 AM, Reynold Xin <r...@databricks.com> wrote:
>
>> I think you are looking for
>> http://en.wikipedia.org/wiki/Common_subexpression_elimination in the
>> optimizer.
>>
>> One thing to note is that as we do more and more optimization like this,
>> the optimization time might increase. Do you see a case where this can
>> bring you substantial performance gains?
>>
>>
>> On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com>
>> wrote:
>>
>>> On second thought, perhaps can this be done by writing a rule that
>>> builds the dag of dependencies between expressions, then convert it into
>>> several layers of projections, where each new layer is allowed to depend on
>>> expression results from previous projections?
>>>
>>> Are there any pitfalls to this approach?
>>>
>>> On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com>
>>> wrote:
>>>
>>>> If I do the following
>>>>
>>>>     df2 = df.withColumn('y', df['x'] * 7)
>>>>     df3 = df2.withColumn('z', df2.y * 3)
>>>>     df3.explain()
>>>>
>>>> Then the result is
>>>>
>>>>     > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS
>>>> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65]
>>>>     >  PhysicalRDD [date#56,id#57,timestamp#58,x#59],
>>>> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163
>>>>
>>>> Effectively I want to compute
>>>>
>>>>     y = f(x)
>>>>     z = g(y)
>>>>
>>>> The catalyst optimizer realizes that y#64 is the same as the one
>>>> previously computed, however, when building the projection, it is ignoring
>>>> the fact that it had already computed y, so it calculates `x * 7` twice.
>>>>
>>>>     y = x * 7
>>>>     z = x * 7 * 3
>>>>
>>>> If I wanted to make this fix, would it be possible to do the logic in
>>>> the optimizer phase? I imagine that it's difficult because the expressions
>>>> in InterpretedMutableProjection don't have access to the previous
>>>> expression results, only the input row, and that the design doesn't seem to
>>>> be catered for this.
>>>>
>>>>
>>
>

Reply via email to