[jira] [Commented] (HIVE-7156) Group-By operator stat-annotation only uses distinct approx to generate rollups

Gopal V (JIRA) Fri, 26 Sep 2014 11:45:08 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149785#comment-14149785
 ]


Gopal V commented on HIVE-7156:
-------------------------------

LGTM - +1, tests pending.

36910400 = (4833087637230 / (256*1024*1024.0)) * 1955

{code}
STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: lineitem
            Statistics: Num rows: 5999989709 Data size: 4833087637230 Basic 
stats: COMPLETE Column stats: COMPLETE
            Select Operator
              expressions: l_shipdate (type: string)
              outputColumnNames: l_shipdate
              Statistics: Num rows: 5999989709 Data size: 4833087637230 Basic 
stats: COMPLETE Column stats: COMPLETE
              Group By Operator
                keys: l_shipdate (type: string)
                mode: hash
                outputColumnNames: _col0
                Statistics: Num rows: 36910400 Data size: 3469577600 Basic 
stats: COMPLETE Column stats: COMPLETE
                Reduce Output Operator
                  key expressions: _col0 (type: string)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: string)
                  Statistics: Num rows: 36910400 Data size: 3469577600 Basic 
stats: COMPLETE Column stats: COMPLETE
      Execution mode: vectorized
      Reduce Operator Tree:
        Group By Operator
          keys: KEY._col0 (type: string)
          mode: mergepartial
          outputColumnNames: _col0
          Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE 
Column stats: COMPLETE
          Select Operator
            expressions: _col0 (type: string)
            outputColumnNames: _col0
            Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE 
Column stats: COMPLETE
            File Output Operator
              compressed: false
              Statistics: Num rows: 1955 Data size: 183770 Basic stats: 
COMPLETE Column stats: COMPLETE
              table:
                  input format: org.apache.hadoop.mapred.TextInputFormat
                  output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                  serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
{code}

> Group-By operator stat-annotation only uses distinct approx to generate 
> rollups
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-7156
>                 URL: https://issues.apache.org/jira/browse/HIVE-7156
>             Project: Hive
>          Issue Type: Sub-task
>    Affects Versions: 0.14.0
>            Reporter: Gopal V
>            Assignee: Prasanth J
>            Priority: Blocker
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch, 
> HIVE-7156.4.patch, HIVE-7156.5.patch, HIVE-7156.6.patch, HIVE-7156.7.patch, 
> HIVE-7156.8.patch, HIVE-7156.8.patch, hive-debug.log.bz2
>
>
> The stats annotation for a group-by only annotates the reduce-side row-count 
> with the distinct values.
> The map-side gets the row-count as the rows output instead of distinct * 
> parallelism, while the reducer side gets the correct parallelism.
> {code}
> hive> explain select distinct L_SHIPDATE from lineitem;
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: lineitem
>                   Statistics: Num rows: 5999989709 Data size: 4745677733354 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Select Operator
>                     expressions: l_shipdate (type: string)
>                     outputColumnNames: l_shipdate
>                     Statistics: Num rows: 5999989709 Data size: 4745677733354 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Group By Operator
>                       keys: l_shipdate (type: string)
>                       mode: hash
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 5999989709 Data size: 
> 563999032646 Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: string)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: string)
>                         Statistics: Num rows: 5999989709 Data size: 
> 563999032646 Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized
>         Reducer 2 
>             Reduce Operator Tree:
>               Group By Operator
>                 keys: KEY._col0 (type: string)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1955 Data size: 183770 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                 Select Operator
>                   expressions: _col0 (type: string)
>                   outputColumnNames: _col0
>                   Statistics: Num rows: 1955 Data size: 183770 Basic stats: 
> COMPLETE Column stats: COMPLETE
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-7156) Group-By operator stat-annotation only uses distinct approx to generate rollups

Reply via email to