[ https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14149785#comment-14149785 ]
Gopal V commented on HIVE-7156: ------------------------------- LGTM - +1, tests pending. 36910400 = (4833087637230 / (256*1024*1024.0)) * 1955 {code} STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: lineitem Statistics: Num rows: 5999989709 Data size: 4833087637230 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: l_shipdate (type: string) outputColumnNames: l_shipdate Statistics: Num rows: 5999989709 Data size: 4833087637230 Basic stats: COMPLETE Column stats: COMPLETE Group By Operator keys: l_shipdate (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 36910400 Data size: 3469577600 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string) sort order: + Map-reduce partition columns: _col0 (type: string) Statistics: Num rows: 36910400 Data size: 3469577600 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: vectorized Reduce Operator Tree: Group By Operator keys: KEY._col0 (type: string) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE File Output Operator compressed: false Statistics: Num rows: 1955 Data size: 183770 Basic stats: COMPLETE Column stats: COMPLETE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe {code} > Group-By operator stat-annotation only uses distinct approx to generate > rollups > ------------------------------------------------------------------------------- > > Key: HIVE-7156 > URL: https://issues.apache.org/jira/browse/HIVE-7156 > Project: Hive > Issue Type: Sub-task > Affects Versions: 0.14.0 > Reporter: Gopal V > Assignee: Prasanth J > Priority: Blocker > Fix For: 0.14.0 > > Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch, > HIVE-7156.4.patch, HIVE-7156.5.patch, HIVE-7156.6.patch, HIVE-7156.7.patch, > HIVE-7156.8.patch, HIVE-7156.8.patch, hive-debug.log.bz2 > > > The stats annotation for a group-by only annotates the reduce-side row-count > with the distinct values. > The map-side gets the row-count as the rows output instead of distinct * > parallelism, while the reducer side gets the correct parallelism. > {code} > hive> explain select distinct L_SHIPDATE from lineitem; > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: lineitem > Statistics: Num rows: 5999989709 Data size: 4745677733354 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: l_shipdate (type: string) > outputColumnNames: l_shipdate > Statistics: Num rows: 5999989709 Data size: 4745677733354 > Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > keys: l_shipdate (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 5999989709 Data size: > 563999032646 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string) > sort order: + > Map-reduce partition columns: _col0 (type: string) > Statistics: Num rows: 5999989709 Data size: > 563999032646 Basic stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Reducer 2 > Reduce Operator Tree: > Group By Operator > keys: KEY._col0 (type: string) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1955 Data size: 183770 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 1955 Data size: 183770 Basic stats: > COMPLETE Column stats: COMPLETE > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)