On second thought, perhaps can this be done by writing a rule that builds
the dag of dependencies between expressions, then convert it into several
layers of projections, where each new layer is allowed to depend on
expression results from previous projections?

Are there any pitfalls to this approach?
On Sat, May 30, 2015 at 11:30 AM Justin Uang <justin.u...@gmail.com> wrote:

> If I do the following
>
>     df2 = df.withColumn('y', df['x'] * 7)
>     df3 = df2.withColumn('z', df2.y * 3)
>     df3.explain()
>
> Then the result is
>
>     > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59
> * 7.0) AS y#64 * 3.0) AS z#65]
>     >  PhysicalRDD [date#56,id#57,timestamp#58,x#59],
> MapPartitionsRDD[125] at mapPartitions at SQLContext.scala:1163
>
> Effectively I want to compute
>
>     y = f(x)
>     z = g(y)
>
> The catalyst optimizer realizes that y#64 is the same as the one
> previously computed, however, when building the projection, it is ignoring
> the fact that it had already computed y, so it calculates `x * 7` twice.
>
>     y = x * 7
>     z = x * 7 * 3
>
> If I wanted to make this fix, would it be possible to do the logic in the
> optimizer phase? I imagine that it's difficult because the expressions in
> InterpretedMutableProjection don't have access to the previous expression
> results, only the input row, and that the design doesn't seem to be catered
> for this.
>
>

Reply via email to