I think this is likely something that we'll want to do during the code
generation phase.  Though its probably not the lowest hanging fruit at this
point.

On Sun, May 31, 2015 at 5:02 AM, Reynold Xin <r...@databricks.com> wrote:

> I think you are looking for
> http://en.wikipedia.org/wiki/Common_subexpression_elimination in the
> optimizer.
>
> One thing to note is that as we do more and more optimization like this,
> the optimization time might increase. Do you see a case where this can
> bring you substantial performance gains?
>
>
> On Sat, May 30, 2015 at 9:02 AM, Justin Uang <justin.u...@gmail.com>
> wrote:
>
>> On second thought, perhaps can this be done by writing a rule that builds
>> the dag of dependencies between expressions, then convert it into several
>> layers of projections, where each new layer is allowed to depend on
>> expression results from previous projections?
>>
>> Are there any pitfalls to this approach?
>>
>> On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com>
>> wrote:
>>
>>> If I do the following
>>>
>>>     df2 = df.withColumn('y', df['x'] * 7)
>>>     df3 = df2.withColumn('z', df2.y * 3)
>>>     df3.explain()
>>>
>>> Then the result is
>>>
>>>     > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS
>>> y#64,((x#59 * 7.0) AS y#64 * 3.0) AS z#65]
>>>     >  PhysicalRDD [date#56,id#57,timestamp#58,x#59],
>>> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163
>>>
>>> Effectively I want to compute
>>>
>>>     y = f(x)
>>>     z = g(y)
>>>
>>> The catalyst optimizer realizes that y#64 is the same as the one
>>> previously computed, however, when building the projection, it is ignoring
>>> the fact that it had already computed y, so it calculates `x * 7` twice.
>>>
>>>     y = x * 7
>>>     z = x * 7 * 3
>>>
>>> If I wanted to make this fix, would it be possible to do the logic in
>>> the optimizer phase? I imagine that it's difficult because the expressions
>>> in InterpretedMutableProjection don't have access to the previous
>>> expression results, only the input row, and that the design doesn't seem to
>>> be catered for this.
>>>
>>>
>

Reply via email to