If I do the following

    df2 = df.withColumn('y', df['x'] * 7)
    df3 = df2.withColumn('z', df2.y * 3)
    df3.explain()

Then the result is

    > Project [date#56,id#57,timestamp#58,x#59,(x#59 * 7.0) AS y#64,((x#59
* 7.0) AS y#64 * 3.0) AS z#65]
    >  PhysicalRDD [date#56,id#57,timestamp#58,x#59], MapPartitionsRDD[125]
at mapPartitions at SQLContext.scala:1163

Effectively I want to compute

    y = f(x)
    z = g(y)

The catalyst optimizer realizes that y#64 is the same as the one previously
computed, however, when building the projection, it is ignoring the fact
that it had already computed y, so it calculates `x * 7` twice.

    y = x * 7
    z = x * 7 * 3

If I wanted to make this fix, would it be possible to do the logic in the
optimizer phase? I imagine that it's difficult because the expressions in
InterpretedMutableProjection don't have access to the previous expression
results, only the input row, and that the design doesn't seem to be catered
for this.

Reply via email to