[ 
https://issues.apache.org/jira/browse/SPARK-14083?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15223010#comment-15223010
 ] 

Josh Rosen edited comment on SPARK-14083 at 4/2/16 7:02 PM:
------------------------------------------------------------

Here's one example of how we might aim to preserve Java/Scala closure API null 
behavior for field accesses:

Consider the following closure:

{code}
    val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, 
null)).toDF()
    ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its 
contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer 
exception on null inputs. If we trust our nullability analysis optimization 
rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} 
calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could 
rewrite their closure to

{code}
      ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as 
boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer 
rules:

- We can propagate the negation of the `if` condition into the attributes of 
the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else 
case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an 
optimization for rewriting {{if(condition, trueLiteral, falseLiteral)}} 
expressions with non-nullable conditions by the condition expression itself, I 
think we could produce exactly the same {{filter _2#3 = 2}} expression that the 
Catalyst expression DSL would have given us.


was (Author: joshrosen):
Here's one example of how we might aim to preserve Java/Scala closure API null 
behavior for field accesses:

Consider the following closure:

{code}
    val ds = Seq[(String, Integer)](("a", 1), ("b", 2), ("c", 3), (null, 
null)).toDF()
    ds.filter(r => r.getInt(1) == 2).collect()
{code}

This code will fail with a NullPointerException in the getInt() call (per its 
contract). This closure's bytecode looks like this:

{code}
aload_1
iconst_1
invokeinterface #22 = Method org.apache.spark.sql.Row.getInt((I)I)
iconst_2
if_icmpne 15
iconst_1
goto 16
iconst_0
ireturn
{code}

My most recent prototype converts this into

{code}
cast(if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as boolean)
{code}

where {{npeonnull}} is a new non-SQL expression which throws a null pointer 
exception on null inputs. If we trust our nullability analysis optimization 
rules, then we could add a trivial optimizer rule to eliminate {{npeonnull}} 
calls when their children are non-nullable.

If a user wanted to implement the SQL filter semantics here, then they could 
rewrite their closure to

{code}
      ds.filter(r => !r.isNullAt(1) && r.getInt(1) == 2)
{code}

My prototype translates this closure into

{code}
cast(if (isnull(_2#3)) 0 else if (NOT (npeonnull(_2#3) = 2)) 0 else 1 as 
boolean)
{code}

Again, I think that this could be easily simplified given some new optimizer 
rules:

- We can propagate the negation of the `if` condition into the attributes of 
the else branch.
- Therefore, we can conclude that column 2 is not null when analyzing the else 
case and can strip out the `npeonnull` check.
- After both optimizations plus cast pushdown, constant folding, and an 
optimization for rewriting {{if}} expressions with non-nullable conditions by 
the condition expression itself, I think we could produce exactly the same 
{{filter _2#3 = 2}} expression that the Catalyst expression DSL would have 
given us.

> Analyze JVM bytecode and turn closures into Catalyst expressions
> ----------------------------------------------------------------
>
>                 Key: SPARK-14083
>                 URL: https://issues.apache.org/jira/browse/SPARK-14083
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> One big advantage of the Dataset API is the type safety, at the cost of 
> performance due to heavy reliance on user-defined closures/lambdas. These 
> closures are typically slower than expressions because we have more 
> flexibility to optimize expressions (known data types, no virtual function 
> calls, etc). In many cases, it's actually not going to be very difficult to 
> look into the byte code of these closures and figure out what they are trying 
> to do. If we can understand them, then we can turn them directly into 
> Catalyst expressions for more optimized executions.
> Some examples are:
> {code}
> df.map(_.name)  // equivalent to expression col("name")
> ds.groupBy(_.gender)  // equivalent to expression col("gender")
> df.filter(_.age > 18)  // equivalent to expression GreaterThan(col("age"), 
> lit(18)
> df.map(_.id + 1)  // equivalent to Add(col("age"), lit(1))
> {code}
> The goal of this ticket is to design a small framework for byte code analysis 
> and use that to convert closures/lambdas into Catalyst expressions in order 
> to speed up Dataset execution. It is a little bit futuristic, but I believe 
> it is very doable. The framework should be easy to reason about (e.g. similar 
> to Catalyst).
> Note that a big emphasis on "small" and "easy to reason about". A patch 
> should be rejected if it is too complicated or difficult to reason about.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to