[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

Navis (JIRA) Thu, 29 May 2014 17:41:24 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14013156#comment-14013156
 ]


Navis commented on HIVE-4867:
-----------------------------

Yes, there is a problem in mapjoin on tez. MR compiler replaces RS with 
HashSink made from value exprs of Join but Tez compiler uses RS as is state 
assuming it has same columns with value exprs of Join, which is not true with 
this patch. Need some more time to fix it.

> Deduplicate columns appearing in both the key list and value list of 
> ReduceSinkOperator
> ---------------------------------------------------------------------------------------
>
>                 Key: HIVE-4867
>                 URL: https://issues.apache.org/jira/browse/HIVE-4867
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Yin Huai
>            Assignee: Navis
>         Attachments: HIVE-4867.1.patch.txt, HIVE-4867.2.patch.txt, 
> source_only.txt
>
>
> A ReduceSinkOperator emits data in the format of keys and values. Right now, 
> a column may appear in both the key list and value list, which result in 
> unnecessary overhead for shuffling. 
> Example:
> We have a query shown below ...
> {code:sql}
> explain select ss_ticket_number from store_sales cluster by ss_ticket_number;
> {\code}
> The plan is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         store_sales 
>           TableScan
>             alias: store_sales
>             Select Operator
>               expressions:
>                     expr: ss_ticket_number
>                     type: int
>               outputColumnNames: _col0
>               Reduce Output Operator
>                 key expressions:
>                       expr: _col0
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: _col0
>                       type: int
>                 tag: -1
>                 value expressions:
>                       expr: _col0
>                       type: int
>       Reduce Operator Tree:
>         Extract
>           File Output Operator
>             compressed: false
>             GlobalTableId: 0
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {\code}
> The column 'ss_ticket_number' is in both the key list and value list of the 
> ReduceSinkOperator. The type of ss_ticket_number is int. For this case, 
> BinarySortableSerDe will introduce 1 byte more for every int in the key. 
> LazyBinarySerDe will also introduce overhead when recording the length of a 
> int. For every int, 10 bytes should be a rough estimation of the size of data 
> emitted from the Map phase. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-4867) Deduplicate columns appearing in both the key list and value list of ReduceSinkOperator

Reply via email to