[ https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001416#comment-14001416 ]
Navis commented on HIVE-4867: ----------------------------- I think the patch is almost ready. But the diff file cannot be attached here(bigger than 10MB). The most part of change is from removing duplicated lineage information. So I'm thinking of fixing that first. > Deduplicate columns appearing in both the key list and value list of > ReduceSinkOperator > --------------------------------------------------------------------------------------- > > Key: HIVE-4867 > URL: https://issues.apache.org/jira/browse/HIVE-4867 > Project: Hive > Issue Type: Improvement > Reporter: Yin Huai > Assignee: Navis > Attachments: HIVE-4867.1.patch.txt, source_only.txt > > > A ReduceSinkOperator emits data in the format of keys and values. Right now, > a column may appear in both the key list and value list, which result in > unnecessary overhead for shuffling. > Example: > We have a query shown below ... > {code:sql} > explain select ss_ticket_number from store_sales cluster by ss_ticket_number; > {\code} > The plan is ... > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > store_sales > TableScan > alias: store_sales > Select Operator > expressions: > expr: ss_ticket_number > type: int > outputColumnNames: _col0 > Reduce Output Operator > key expressions: > expr: _col0 > type: int > sort order: + > Map-reduce partition columns: > expr: _col0 > type: int > tag: -1 > value expressions: > expr: _col0 > type: int > Reduce Operator Tree: > Extract > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Stage: Stage-0 > Fetch Operator > limit: -1 > {\code} > The column 'ss_ticket_number' is in both the key list and value list of the > ReduceSinkOperator. The type of ss_ticket_number is int. For this case, > BinarySortableSerDe will introduce 1 byte more for every int in the key. > LazyBinarySerDe will also introduce overhead when recording the length of a > int. For every int, 10 bytes should be a rough estimation of the size of data > emitted from the Map phase. -- This message was sent by Atlassian JIRA (v6.2#6252)