Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Michael Armbrust Tue, 27 Dec 2016 19:34:12 -0800

An encoder uses reflection
<https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala>
to generate expressions that can extract data out of an object (by calling
methods on the object) and encode its contents directly into the tungsten
binary row format (and vice versa).  We codegenerate bytecode that
evaluates these expression in the same way that we code generate code for
normal expression evaluation in query processing.  However, this reflection
only works for simple ATDs
<https://en.wikipedia.org/wiki/Algebraic_data_type>.  Another key thing to
realize is that we do this reflection / code generation at runtime, so we
aren't constrained by binary compatibility across versions of spark.

UDTs let you write custom code that translates an object into into a
generic row, which we can then translate into Spark's internal format
(using a RowEncoder). Unlike expressions and tungsten binary encoding, the
Row type that you return here is a stable public API that hasn't changed
since Spark 1.3.

So to summarize, if encoders don't work for your specific types you can use
UDTs, but they probably won't be as efficient. I'd love to unify these code
paths more, but its actually a fair amount of work to come up with a good
stable public API that doesn't sacrifice performance.

On Tue, Dec 27, 2016 at 6:32 AM, dragonly <[email protected]> wrote:

> I'm recently reading the source code of the SparkSQL project, and found
> some
> interesting databricks blogs about the tungsten project. I've roughly read
> through the encoder and unsafe representation part of the tungsten
> project(haven't read the algorithm part such as cache friendly hashmap
> algorithms).
> Now there's a big puzzle in front of me about the codegen of SparkSQL and
> how does the codegen utilize the tungsten encoding between JMV objects and
> unsafe bits.
> So can anyone tell me that's the main difference in situations where I
> write
> a UDT like ExamplePointUDT in SparkSQL or just create an ArrayType which
> can
> be handled by the tungsten encoder? I'll really appreciate it if you can go
> through some concrete code examples. thanks a lot!
>
>
>
> --
> View this message in context: http://apache-spark-
> developers-list.1001551.n3.nabble.com/What-is-mainly-
> different-from-a-UDT-and-a-spark-internal-type-that-
> ExpressionEncoder-recognized-tp20370.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: [email protected]
>
>

Re: What is mainly different from a UDT and a spark internal type that ExpressionEncoder recognized?

Reply via email to