[ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14652390#comment-14652390
 ] 

Herman van Hovell commented on SPARK-9357:
------------------------------------------

+1 for removing this.

The {{AlgebraicAggregate}} part of the new UDAF interfaces uses this a lot in 
very performance critical sections. I ran into this doing some benchmarking for 
the SPARK-8641 ticket. After some profiling I found out that a significant 
amount of time is spent in JoinedRow; It causes a major performance regression 
in some cases.

I have been doing some experimentation:
* After a discussion with [~yhuai] I tried removing the branching in the joined 
row. This improved the situation by 10%.
* I created specialized {{BoundReference}}'s which bind directly to the 
{{row1}} or {{row2}} values of the JoinedRow (exposed those through getters). 
This worked wonders. Performance is much better in most cases now. In the end I 
think it is best to explicitly start to support a {{JoinProjection}} which 
takes a left and a right row as input and produces an output row, and change 
{{SparkPlan}} and CG accordingly. I think we'd still need JoinedRow in the 
interpreted case though. 

I can turn the POC I have made into a PR for discussion.

> Remove JoinedRow
> ----------------
>
>                 Key: SPARK-9357
>                 URL: https://issues.apache.org/jira/browse/SPARK-9357
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>            Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to