[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns

Yin Huai (JIRA) Wed, 03 Jul 2013 11:35:34 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13699294#comment-13699294
 ]


Yin Huai commented on HIVE-4809:
--------------------------------

For a OVER clause, we can have partitioning columns (specified by PARTITION BY) 
and ordering columns (specified by ORDER BY). In the current implementation, we 
use the key columns of ReduceSinkOperator (RS) to take care both grouping (for 
those partitioning columns) and ordering (for those ordering columns). So, we 
first add all partitioning columns and then add all ordering columns to the key 
columns of the RS. If we do not specify ordering columns, we will use 
partitioning columns as ordering columns. Seems we cannot completely remove 
those duplicate key columns right now (because key columns of RS need to take 
care both grouping and ordering). But, we can optimize certain cases. For 
example, if ordering columns are not specified, we do not assign those 
partition columns to ordering columns.
                
> ReduceSinkOperator of PTFOperator can have redundant key columns
> ----------------------------------------------------------------
>
>                 Key: HIVE-4809
>                 URL: https://issues.apache.org/jira/browse/HIVE-4809
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Yin Huai
>            Assignee: Yin Huai
>
> For example, we have a simple query like this ...
> {code:sql}
> SELECT x.a, x.b, count(x.b) OVER (PARTITION BY x.a) FROM src x;
> {\code}
> The plan of it is ...
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         x 
>           TableScan
>             alias: x
>             Reduce Output Operator
>               key expressions:
>                     expr: a
>                     type: int
>                     expr: a
>                     type: int
>               sort order: ++
>               Map-reduce partition columns:
>                     expr: a
>                     type: int
>               tag: -1
>               value expressions:
>                     expr: a
>                     type: int
>                     expr: b
>                     type: string
>       Reduce Operator Tree:
>         Extract
>           PTF Operator
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: string
>                     expr: _wcol0
>                     type: bigint
>               outputColumnNames: _col0, _col1, _col2
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {\code}
> The ReduceSinkOperator has two "a" in its key columns. This redundancy can 
> increase the size of map output.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4809) ReduceSinkOperator of PTFOperator can have redundant key columns

Reply via email to