[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013156#comment-14013156 ]
Navis commented on HIVE-4867: ----------------------------- Yes, there is a problem in mapjoin on tez. MR compiler replaces RS with HashSink made from value exprs of Join but Tez compiler uses RS as is state assuming it has same columns with value exprs of Join, which is not true with this patch. Need some more time to fix it. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --------------------------------------------------------------------------------------- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement > Reporter: Yin Huai > Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, > source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)