based on that i take it that math functions would be primary beneficiaries since they work on primitives.
so if i take UnaryMathExpression as an example, would i not get the same benefit if i change it to this? abstract class UnaryMathExpression(val f: Double => Double, name: String) extends UnaryExpression with Serializable with ImplicitCastInputTypes { override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType) override def dataType: DataType = DoubleType override def nullable: Boolean = true override def toString: String = s"$name($child)" override def prettyName: String = name protected override def nullSafeEval(input: Any): Any = { f(input.asInstanceOf[Double]) } // name of function in java.lang.Math def funcName: String = name.toLowerCase def function(d: Double): Double = f(d) override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { val self = ctx.addReferenceObj(name, this, getClass.getName) defineCodeGen(ctx, ev, c => s"$self.function($c)") } } admittedly in this case the benefit in terms of removing complex codegen is not there (the codegen was only one line), but if i can remove codegen here i could also remove it in lots of other places where it does get very unwieldy simply by replacing it with calls to methods. Function1 is specialized, so i think (or hope) that my version does no extra boxes/unboxing. On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin <r...@databricks.com> wrote: > With complex types it doesn't work as well, but for primitive types the > biggest benefit of whole stage codegen is that we don't even need to put > the intermediate data into rows or columns anymore. They are just variables > (stored in CPU registers). > > On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> so i have been looking for a while now at all the catalyst expressions, >> and all the relative complex codegen going on. >> >> so first off i get the benefit of codegen to turn a bunch of chained >> iterators transformations into a single codegen stage for spark. that makes >> sense to me, because it avoids a bunch of overhead. >> >> but what i am not so sure about is what the benefit is of converting the >> actual stuff that happens inside the iterator transformations into codegen. >> >> say if we have an expression that has 2 children and creates a struct for >> them. why would this be faster in codegen by re-creating the code to do >> this in a string (which is complex and error prone) compared to simply have >> the codegen call the normal method for this in my class? >> >> i see so much trivial code be re-created in codegen. stuff like this: >> >> private[this] def castToDateCode( >> from: DataType, >> ctx: CodegenContext): CastFunction = from match { >> case StringType => >> val intOpt = ctx.freshName("intOpt") >> (c, evPrim, evNull) => s""" >> scala.Option<Integer> $intOpt = >> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat >> e($c); >> if ($intOpt.isDefined()) { >> $evPrim = ((Integer) $intOpt.get()).intValue(); >> } else { >> $evNull = true; >> } >> """ >> >> is this really faster than simply calling an equivalent functions from >> the codegen, and keeping the codegen logic restricted to the "unrolling" of >> chained iterators? >> >> >