thanks for that detailed response! On Mon, Feb 13, 2017 at 12:56 AM, Sumedh Wale <sw...@snappydata.io> wrote:
> The difference is closure invocation instead of a static java.lang.Math > call. In many cases JIT may not be able to perform inlining and related > code optimizations though in this specific case it should. This is highly > dependent on the specific case, but when inlining cannot be done and it > leads to a method call (especially virtual call) then the difference is > quite large: few nanoseconds per evaluation vs tens of nanoseconds in my > experiments. > Serialization of an additional object as a reference can have a measurable > effect for low-latency jobs though usually can be ignored. > > What has been observed is that if an expression uses CodegenFallback then > it becomes an order of magnitude slower or more. Most of it is due to > UnsafeRow read/write overhead which is avoided here, but still care needs > to be taken for (virtual) function calls too. In some cases JIT does inline > virtual calls but may not always happen. In my experience the only reliable > case where it does inline is when the virtual call is on a local variable > that does not change for multiple invocations (e.g. a final local variable > outside the while loop of a doProduce). > > I think what should work better is encapsulating such code in methods of a > scala object rather than a class and those can be invoked in generated code > like static methods. Such calls should be equivalent to inline code > generation in most cases since JIT will inline the calls where it will > determine significant benefit. In some cases such method calls will have > better CPU instruction cache hits (i.e. if same inline code is emitted > multiple times vs common method calls). All this needs thorough > micro/macro-benchmarking. > > However, I don't recall any large pieces of generated code where this can > help. Most complex pieces like in > HashAggregateExec/SortMergeJoinExec/BroadcastHashJoinExec > are so because they generate schema specific code (to avoid virtual calls > and boxing/unboxing, and UnsafeRow read/write in some cases) which is > significantly faster than the equivalent generic code in doExecute. Or in > your "castToDateCode" example, don't see how you can reduce it since bulk > of code is already in the static stringToDate call. > > > > On Saturday 11 February 2017 03:02 AM, Koert Kuipers wrote: > > based on that i take it that math functions would be primary beneficiaries > since they work on primitives. > > so if i take UnaryMathExpression as an example, would i not get the same > benefit if i change it to this? > > abstract class UnaryMathExpression(val f: Double => Double, name: String) > extends UnaryExpression with Serializable with ImplicitCastInputTypes { > > override def inputTypes: Seq[AbstractDataType] = Seq(DoubleType) > override def dataType: DataType = DoubleType > override def nullable: Boolean = true > override def toString: String = s"$name($child)" > override def prettyName: String = name > > protected override def nullSafeEval(input: Any): Any = { > f(input.asInstanceOf[Double]) > } > > // name of function in java.lang.Math > def funcName: String = name.toLowerCase > > def function(d: Double): Double = f(d) > > override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = { > val self = ctx.addReferenceObj(name, this, getClass.getName) > defineCodeGen(ctx, ev, c => s"$self.function($c)") > } > } > > admittedly in this case the benefit in terms of removing complex codegen > is not there (the codegen was only one line), but if i can remove codegen > here i could also remove it in lots of other places where it does get very > unwieldy simply by replacing it with calls to methods. > > Function1 is specialized, so i think (or hope) that my version does no > extra boxes/unboxing. > > On Fri, Feb 10, 2017 at 2:24 PM, Reynold Xin <r...@databricks.com> wrote: > >> With complex types it doesn't work as well, but for primitive types the >> biggest benefit of whole stage codegen is that we don't even need to put >> the intermediate data into rows or columns anymore. They are just variables >> (stored in CPU registers). >> >> On Fri, Feb 10, 2017 at 8:22 PM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> so i have been looking for a while now at all the catalyst expressions, >>> and all the relative complex codegen going on. >>> >>> so first off i get the benefit of codegen to turn a bunch of chained >>> iterators transformations into a single codegen stage for spark. that makes >>> sense to me, because it avoids a bunch of overhead. >>> >>> but what i am not so sure about is what the benefit is of converting the >>> actual stuff that happens inside the iterator transformations into codegen. >>> >>> say if we have an expression that has 2 children and creates a struct >>> for them. why would this be faster in codegen by re-creating the code to do >>> this in a string (which is complex and error prone) compared to simply have >>> the codegen call the normal method for this in my class? >>> >>> i see so much trivial code be re-created in codegen. stuff like this: >>> >>> private[this] def castToDateCode( >>> from: DataType, >>> ctx: CodegenContext): CastFunction = from match { >>> case StringType => >>> val intOpt = ctx.freshName("intOpt") >>> (c, evPrim, evNull) => s""" >>> scala.Option<Integer> $intOpt = >>> org.apache.spark.sql.catalyst.util.DateTimeUtils.stringToDat >>> e($c); >>> if ($intOpt.isDefined()) { >>> $evPrim = ((Integer) $intOpt.get()).intValue(); >>> } else { >>> $evNull = true; >>> } >>> """ >>> >>> is this really faster than simply calling an equivalent functions from >>> the codegen, and keeping the codegen logic restricted to the "unrolling" of >>> chained iterators? >>> >>> >> > >