so i have been looking for a while now at all the catalyst expressions, and
all the relative complex codegen going on.
so first off i get the benefit of codegen to turn a bunch of chained
iterators transformations into a single codegen stage for spark. that makes
sense to me, because it avoids a bunch of overhead.
but what i am not so sure about is what the benefit is of converting the
actual stuff that happens inside the iterator transformations into codegen.
say if we have an expression that has 2 children and creates a struct for
them. why would this be faster in codegen by re-creating the code to do
this in a string (which is complex and error prone) compared to simply have
the codegen call the normal method for this in my class?
i see so much trivial code be re-created in codegen. stuff like this:
private[this] def castToDateCode(
from: DataType,
ctx: CodegenContext): CastFunction = from match {
case StringType =>
val intOpt = ctx.freshName("intOpt")
(c, evPrim, evNull) => s"""
scala.Option<Integer> $intOpt =
org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDate($c);
if ($intOpt.isDefined()) {
$evPrim = ((Integer) $intOpt.get()).intValue();
} else {
$evNull = true;
}
"""
is this really faster than simply calling an equivalent functions from the
codegen, and keeping the codegen logic restricted to the "unrolling" of
chained iterators?