[ https://issues.apache.org/jira/browse/HIVE-8701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218898#comment-14218898 ]
Szehon Ho commented on HIVE-8701: --------------------------------- Hi Suhas, sure the plan I see looks like this, for a modified plan of auto_join2 that forces mapjoin to be in the same operator: {noformat} STAGE DEPENDENCIES: Stage-3 is a root stage Stage-1 depends on stages: Stage-3 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-3 Spark #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: src2 Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 250 Data size: 2656 Basic stats: COMPLETE Column stats: NONE Spark HashTable Sink Operator condition expressions: 0 {key} 1 keys: 0 key (type: string) 1 key (type: string) Local Work: Map Reduce Local Work Map 3 Map Operator Tree: TableScan alias: smalltable Statistics: Num rows: 0 Data size: 30 Basic stats: PARTIAL Column stats: NONE Filter Operator predicate: UDFToDouble(key) is not null (type: boolean) Statistics: Num rows: 0 Data size: 0 Basic stats: NONE Column stats: NONE Spark HashTable Sink Operator condition expressions: 0 {_col0} {_col5} 1 {key} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) Local Work: Map Reduce Local Work Stage: Stage-1 Spark #### A masked pattern was here #### Vertices: Map 2 Map Operator Tree: TableScan alias: src1 Statistics: Num rows: 500 Data size: 5312 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: key is not null (type: boolean) Statistics: Num rows: 250 Data size: 2656 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {key} 1 {key} keys: 0 key (type: string) 1 key (type: string) outputColumnNames: _col0, _col5 input vertices: 1 Map 1 Statistics: Num rows: 275 Data size: 2921 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (_col0 + _col5) is not null (type: boolean) Statistics: Num rows: 138 Data size: 1465 Basic stats: COMPLETE Column stats: NONE Map Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {_col0} {_col5} 1 {key} keys: 0 (_col0 + _col5) (type: double) 1 UDFToDouble(key) (type: double) outputColumnNames: _col0, _col5, _col10 input vertices: 1 Map 3 Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string), _col5 (type: string), _col10 (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 151 Data size: 1611 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Local Work: Map Reduce Local Work Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} The issue is there are two mapjoins in the same work, which is actually good most of time, but we should make sure we don't overwhelm the executor memory in that case. Check should be straight-forward in theory, just to include also size of any parent mapjoin that is directly connected (ie, no RS or HTS) in the calculation of table size. > Combine nested map joins into the parent map join if possible [Spark Branch] > ---------------------------------------------------------------------------- > > Key: HIVE-8701 > URL: https://issues.apache.org/jira/browse/HIVE-8701 > Project: Hive > Issue Type: Sub-task > Components: Spark > Reporter: Xuefu Zhang > Assignee: Szehon Ho > > With the work in HIVE-8616 enabled, the generated plan shows that the nested > map join operator isn't merged to its parent when possible. This is > demonstrated in auto_join2.q. The MR plan shown that this optimization is in > place. We should do the same for Spark. > {code} > STAGE PLANS: > Stage: Stage-1 > Spark > Edges: > Map 2 <- Map 3 (NONE, 0) > Map 3 <- Map 1 (NONE, 0) > DagName: xzhang_20141102074141_ac089634-bf01-4386-b1cf-3e7f2e99f6eb:3 > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: src2 > Statistics: Num rows: 58 Data size: 5812 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: key is not null (type: boolean) > Statistics: Num rows: 29 Data size: 2906 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: key (type: string) > sort order: + > Map-reduce partition columns: key (type: string) > Statistics: Num rows: 29 Data size: 2906 Basic stats: > COMPLETE Column stats: NONE > Map 2 > Map Operator Tree: > TableScan > alias: src3 > Statistics: Num rows: 29 Data size: 5812 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: UDFToDouble(key) is not null (type: boolean) > Statistics: Num rows: 15 Data size: 3006 Basic stats: > COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {_col0} > 1 {value} > keys: > 0 (_col0 + _col5) (type: double) > 1 UDFToDouble(key) (type: double) > outputColumnNames: _col0, _col11 > input vertices: > 0 Map 3 > Statistics: Num rows: 17 Data size: 1813 Basic stats: > COMPLETE Column stats: NONE > Select Operator > expressions: _col0 (type: string), _col11 (type: > string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 17 Data size: 1813 Basic stats: > COMPLETE Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 17 Data size: 1813 Basic > stats: COMPLETE Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Map 3 > Map Operator Tree: > TableScan > alias: src1 > Statistics: Num rows: 58 Data size: 5812 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: key is not null (type: boolean) > Statistics: Num rows: 29 Data size: 2906 Basic stats: > COMPLETE Column stats: NONE > Map Join Operator > condition map: > Inner Join 0 to 1 > condition expressions: > 0 {key} > 1 {key} > keys: > 0 key (type: string) > 1 key (type: string) > outputColumnNames: _col0, _col5 > input vertices: > 1 Map 1 > Statistics: Num rows: 31 Data size: 3196 Basic stats: > COMPLETE Column stats: NONE > Filter Operator > predicate: (_col0 + _col5) is not null (type: boolean) > Statistics: Num rows: 16 Data size: 1649 Basic stats: > COMPLETE Column stats: NONE > Reduce Output Operator > key expressions: (_col0 + _col5) (type: double) > sort order: + > Map-reduce partition columns: (_col0 + _col5) > (type: double) > Statistics: Num rows: 16 Data size: 1649 Basic > stats: COMPLETE Column stats: NONE > value expressions: _col0 (type: string) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)