Function1 is specialized, but nullSafeEval is Any => Any, so that's still going to box in the non-codegened execution path.
On Fri, Feb 10, 2017 at 1:32 PM, Koert Kuipers <ko...@tresata.com> wrote: > based on that i take it that math functions would be primary beneficiaries > since they work on primitives. > > so if i take UnaryMathExpression as an example, would i not get the same > benefit if i change it to this? > > abstract class UnaryMathExpression(val f: Double => Double, name: String) > extends UnaryExpression with Serializable with ImplicitCastInputTypes { > > override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType) > override def dataType: DataType = DoubleType > override def nullable: Boolean = true > override def toString: String = s"$name($child)" > override def prettyName: String = name > > protected override def nullSafeEval(input: Any): Any = { > f(input.asInstanceOf[Double]) > } > > // name of function in java.lang.Math > def funcName: String = name.toLowerCase > > def function(d: Double): Double = f(d) > > override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { > val self = ctx.addReferenceObj(name, this, getClass.getName) > defineCodeGen(ctx, ev, c => s"$self.function($c)") > } > } > > admittedly in this case the benefit in terms of removing complex codegen > is not there (the codegen was only one line), but if i can remove codegen > here i could also remove it in lots of other places where it does get very > unwieldy simply by replacing it with calls to methods. > > Function1 is specialized, so i think (or hope) that my version does no > extra boxes/unboxing. > > On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin <r...@databricks.com> wrote: > >> With complex types it doesn't work as well, but for primitive types the >> biggest benefit of whole stage codegen is that we don't even need to put >> the intermediate data into rows or columns anymore. They are just variables >> (stored in CPU registers). >> >> On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> so i have been looking for a while now at all the catalyst expressions, >>> and all the relative complex codegen going on. >>> >>> so first off i get the benefit of codegen to turn a bunch of chained >>> iterators transformations into a single codegen stage for spark. that makes >>> sense to me, because it avoids a bunch of overhead. >>> >>> but what i am not so sure about is what the benefit is of converting the >>> actual stuff that happens inside the iterator transformations into codegen. >>> >>> say if we have an expression that has 2 children and creates a struct >>> for them. why would this be faster in codegen by re-creating the code to do >>> this in a string (which is complex and error prone) compared to simply have >>> the codegen call the normal method for this in my class? >>> >>> i see so much trivial code be re-created in codegen. stuff like this: >>> >>> private[this] def castToDateCode( >>> from: DataType, >>> ctx: CodegenContext): CastFunction = from match { >>> case StringType => >>> val intOpt = ctx.freshName("intOpt") >>> (c, evPrim, evNull) => s""" >>> scala.Option<Integer> $intOpt = >>> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat >>> e($c); >>> if ($intOpt.isDefined()) { >>> $evPrim = ((Integer) $intOpt.get()).intValue(); >>> } else { >>> $evNull = true; >>> } >>> """ >>> >>> is this really faster than simply calling an equivalent functions from >>> the codegen, and keeping the codegen logic restricted to the "unrolling" of >>> chained iterators? >>> >>> >> >