so i have been looking for a while now at all the catalyst expressions, and
all the relative complex codegen going on.

so first off i get the benefit of codegen to turn a bunch of chained
iterators transformations into a single codegen stage for spark. that makes
sense to me, because it avoids a bunch of overhead.

but what i am not so sure about is what the benefit is of converting the
actual stuff that happens inside the iterator transformations into codegen.

say if we have an expression that has 2 children and creates a struct for
them. why would this be faster in codegen by re-creating the code to do
this in a string (which is complex and error prone) compared to simply have
the codegen call the normal method for this in my class?

i see so much trivial code be re-created in codegen. stuff like this:

  private[this] def castToDateCode(
      from: DataType,
      ctx: CodegenContext): CastFunction = from match {
    case StringType =>
      val intOpt = ctx.freshName("intOpt")
      (c, evPrim, evNull) => s"""
        scala.Option<Integer> $intOpt =
          org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDate($c);
        if ($intOpt.isDefined()) {
          $evPrim = ((Integer) $intOpt.get()).intValue();
        } else {
          $evNull = true;
        }
       """

is this really faster than simply calling an equivalent functions from the
codegen, and keeping the codegen logic restricted to the "unrolling" of
chained iterators?

Reply via email to