[ https://issues.apache.org/jira/browse/SPARK-9357?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704311#comment-14704311 ]
Cheng Hao edited comment on SPARK-9357 at 8/20/15 5:29 AM: ----------------------------------------------------------- JoinedRow does increase the overhead by adding layer of indirection, however, it is a trade-off, as copying the 2 non-continue pieces of memory together also causes performance overhead, particularly the case I listed above, only a few records really need by the downstream operators(writing to files) after the filtering. The n-ary JoinedRow will be really helpful in case like the sequential joins. For example: {code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and a.col1>b.col1 and b.col2<c.col3 {code} As we know the join keys are exactly the same, and the join operators compute the result in the same stage, and then the intermediate row(like below), will be sent to operator Filter for predicating the (a.col1>b.col1 and b.col2<c.col3); {noformat} JoinedRow(a,b,c) / \ JoinedRow(a, b) row(c) / \ row(a) row(b) {noformat} It's probably more helpful if we can codegen a row layer for supporting the n-ary joined rows, Definitely save memory copyings, as we never copy even a single byte at all. was (Author: chenghao): JoinedRow does increase the overhead by adding layer of indirection, however, it is a trade-off, as copying the 2 non-continue pieces of memory together is also causes performance issue, particularly the case I listed above, only a few records really need by the downstream operators(writing to files) after the filtering. The n-ary JoinedRow will be really helpful in case like the sequential joins. For example: {code} SELECT* FROM a join b on a.key=b.key join c on a.key=c.key and a.col1>b.col1 and b.col2<c.col3 {code} As we know the join keys are exactly the same, and the join operators compute the result in the same stage, and then the intermediate row(like below), will be sent to operator Filter for predicating the (a.col1>b.col1 and b.col2<c.col3); {noformat} JoinedRow(a,b,c) / \ JoinedRow(a, b) row(c) / \ row(a) row(b) {noformat} It's probably more helpful if we can codegen a row layer for supporting the n-ary joined rows, Definitely save memory copyings, as we never copy even a single byte at all. > Remove JoinedRow > ---------------- > > Key: SPARK-9357 > URL: https://issues.apache.org/jira/browse/SPARK-9357 > Project: Spark > Issue Type: Umbrella > Components: SQL > Reporter: Reynold Xin > > JoinedRow was introduced to join two rows together, in aggregation (join key > and value), joins (left, right), window functions, etc. > It aims to reduce the amount of data copied, but incurs branches when the row > is actually read. Given all the fields will be read almost all the time > (otherwise they get pruned out by the optimizer), branch predictor cannot do > anything about those branches. > I think a better way is just to remove this thing, and materializes the row > data directly. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org