[ 
https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704311#comment-14704311
 ] 

Cheng Hao edited comment on SPARK-9357 at 8/20/15 5:29 AM:
-----------------------------------------------------------

JoinedRow does increase the overhead by adding layer of indirection, however, 
it is a trade-off, as copying the 2 non-continue pieces of memory together also 
causes performance overhead, particularly the case I listed above, only a few 
records really need by the downstream operators(writing to files) after the 
filtering.

The n-ary JoinedRow will be really helpful in case like the sequential joins. 
For example:
{code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and 
a.col1>b.col1 and b.col2<c.col3 {code}

As we know the join keys are exactly the same, and the join operators compute 
the result in the same stage, and then the intermediate row(like below), will 
be sent to operator Filter for predicating the (a.col1>b.col1 and 
b.col2<c.col3);
{noformat}
          JoinedRow(a,b,c)
              /      \
JoinedRow(a, b)    row(c)
        /   \
row(a)    row(b)
{noformat}
It's probably more helpful if we can codegen a row layer for supporting the 
n-ary joined rows, Definitely save memory copyings, as we never copy even a 
single byte at all.


was (Author: chenghao):
JoinedRow does increase the overhead by adding layer of indirection, however, 
it is a trade-off, as copying the 2 non-continue pieces of memory together is 
also causes performance issue, particularly the case I listed above, only a few 
records really need by the downstream operators(writing to files) after the 
filtering.

The n-ary JoinedRow will be really helpful in case like the sequential joins. 
For example:
{code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and 
a.col1>b.col1 and b.col2<c.col3 {code}

As we know the join keys are exactly the same, and the join operators compute 
the result in the same stage, and then the intermediate row(like below), will 
be sent to operator Filter for predicating the (a.col1>b.col1 and 
b.col2<c.col3);
{noformat}
          JoinedRow(a,b,c)
              /      \
JoinedRow(a, b)    row(c)
        /   \
row(a)    row(b)
{noformat}
It's probably more helpful if we can codegen a row layer for supporting the 
n-ary joined rows, Definitely save memory copyings, as we never copy even a 
single byte at all.

> Remove JoinedRow
> ----------------
>
>                 Key: SPARK-9357
>                 URL: https://issues.apache.org/jira/browse/SPARK-9357
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>            Reporter: Reynold Xin
>
> JoinedRow was introduced to join two rows together, in aggregation (join key 
> and value), joins (left, right), window functions, etc.
> It aims to reduce the amount of data copied, but incurs branches when the row 
> is actually read. Given all the fields will be read almost all the time 
> (otherwise they get pruned out by the optimizer), branch predictor cannot do 
> anything about those branches.
> I think a better way is just to remove this thing, and materializes the row 
> data directly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to