[
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15222611#comment-15222611
]
Josh Rosen commented on SPARK-14083:
------------------------------------
Null-handling is going to present a major design challenge here. If we want to
exactly preserve the behavior of the Java closure then we need to ensure that
our translation does not add implicit null-handling which differs from the
JVM's own handling. For example:
- What happens today if a user calls .getInt(..) on a column which is null? We
need to preserve the current behavior.
- What if a user calls .getString(...).equals("foo") on a row where the string
column is null? Today the user's code will throw a NullPointerException. To
preserve this behavior, we might need to add an expression which throws
exceptions on null values.
- In Java, casting null to a numeric type returns the zero-value of that type,
whereas SQL casts preserve nulls.
I think that we don't have a choice but to faithfully preserve Java's null
semantics. If we didn't, then subtle differences in Java closures or in
compilers' emitted bytecode could alter the result of queries.
> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
> Key: SPARK-14083
> URL: https://issues.apache.org/jira/browse/SPARK-14083
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of
> performance due to heavy reliance on user-defined closures/lambdas. These
> closures are typically slower than expressions because we have more
> flexibility to optimize expressions (known data types, no virtual function
> calls, etc). In many cases, it's actually not going to be very difficult to
> look into the byte code of these closures and figure out what they are trying
> to do. If we can understand them, then we can turn them directly into
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name) // equivalent to expression col("name")
> ds.groupBy(_.gender) // equivalent to expression col("gender")
> df.filter(_.age > 18) // equivalent to expression GreaterThan(col("age"),
> lit(18)
> df.map(_.id + 1) // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis
> and use that to convert closures/lambdas into Catalyst expressions in order
> to speed up Dataset execution. It is a little bit futuristic, but I believe
> it is very doable. The framework should be easy to reason about (e.g. similar
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch
> should be rejected if it is too complicated or difficult to reason about.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]