based on that i take it that math functions would be primary beneficiaries
since they work on primitives.
so if i take UnaryMathExpression as an example, would i not get the same
benefit if i change it to this?
abstract class UnaryMathExpression(val f: Double => Double, name: String)
extends UnaryExpression with Serializable with ImplicitCastInputTypes {
override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType)
override def dataType: DataType = DoubleType
override def nullable: Boolean = true
override def toString: String = s"$name($child)"
override def prettyName: String = name
protected override def nullSafeEval(input: Any): Any = {
f(input.asInstanceOf[Double])
}
// name of function in java.lang.Math
def funcName: String = name.toLowerCase
def function(d: Double): Double = f(d)
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val self = ctx.addReferenceObj(name, this, getClass.getName)
defineCodeGen(ctx, ev, c => s"$self.function($c)")
}
}
admittedly in this case the benefit in terms of removing complex codegen is
not there (the codegen was only one line), but if i can remove codegen here
i could also remove it in lots of other places where it does get very
unwieldy simply by replacing it with calls to methods.
Function1 is specialized, so i think (or hope) that my version does no
extra boxes/unboxing.
On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin <[email protected]> wrote:
> With complex types it doesn't work as well, but for primitive types the
> biggest benefit of whole stage codegen is that we don't even need to put
> the intermediate data into rows or columns anymore. They are just variables
> (stored in CPU registers).
>
> On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <[email protected]> wrote:
>
>> so i have been looking for a while now at all the catalyst expressions,
>> and all the relative complex codegen going on.
>>
>> so first off i get the benefit of codegen to turn a bunch of chained
>> iterators transformations into a single codegen stage for spark. that makes
>> sense to me, because it avoids a bunch of overhead.
>>
>> but what i am not so sure about is what the benefit is of converting the
>> actual stuff that happens inside the iterator transformations into codegen.
>>
>> say if we have an expression that has 2 children and creates a struct for
>> them. why would this be faster in codegen by re-creating the code to do
>> this in a string (which is complex and error prone) compared to simply have
>> the codegen call the normal method for this in my class?
>>
>> i see so much trivial code be re-created in codegen. stuff like this:
>>
>> private[this] def castToDateCode(
>> from: DataType,
>> ctx: CodegenContext): CastFunction = from match {
>> case StringType =>
>> val intOpt = ctx.freshName("intOpt")
>> (c, evPrim, evNull) => s"""
>> scala.Option<Integer> $intOpt =
>> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat
>> e($c);
>> if ($intOpt.isDefined()) {
>> $evPrim = ((Integer) $intOpt.get()).intValue();
>> } else {
>> $evNull = true;
>> }
>> """
>>
>> is this really faster than simply calling an equivalent functions from
>> the codegen, and keeping the codegen logic restricted to the "unrolling" of
>> chained iterators?
>>
>>
>