[ https://issues.apache.org/jira/browse/HIVE-7156?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14155554#comment-14155554 ]
Gopal V commented on HIVE-7156: ------------------------------- bq. My point is, it's probably better if we have clean code path if anything is related to execution engine, but this method doesn't seem resembling anything like that. Agreed. The CBO rules should be the only ones estimating data sizes moving between operators. In the process of moving more optimizations into CBO cost-based rules, we'll deprecate these rules entirely. > Group-By operator stat-annotation only uses distinct approx to generate > rollups > ------------------------------------------------------------------------------- > > Key: HIVE-7156 > URL: https://issues.apache.org/jira/browse/HIVE-7156 > Project: Hive > Issue Type: Sub-task > Affects Versions: 0.14.0 > Reporter: Gopal V > Assignee: Prasanth J > Priority: Blocker > Labels: TODOC14 > Fix For: 0.14.0 > > Attachments: HIVE-7156.1.patch, HIVE-7156.2.patch, HIVE-7156.3.patch, > HIVE-7156.4.patch, HIVE-7156.5.patch, HIVE-7156.6.patch, HIVE-7156.7.patch, > HIVE-7156.8.patch, HIVE-7156.8.patch, HIVE-7156.9.patch, hive-debug.log.bz2 > > > The stats annotation for a group-by only annotates the reduce-side row-count > with the distinct values. > The map-side gets the row-count as the rows output instead of distinct * > parallelism, while the reducer side gets the correct parallelism. > {code} > hive> explain select distinct L_SHIPDATE from lineitem; > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: lineitem > Statistics: Num rows: 5999989709 Data size: 4745677733354 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: l_shipdate (type: string) > outputColumnNames: l_shipdate > Statistics: Num rows: 5999989709 Data size: 4745677733354 > Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > keys: l_shipdate (type: string) > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 5999989709 Data size: > 563999032646 Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string) > sort order: + > Map-reduce partition columns: _col0 (type: string) > Statistics: Num rows: 5999989709 Data size: > 563999032646 Basic stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized > Reducer 2 > Reduce Operator Tree: > Group By Operator > keys: KEY._col0 (type: string) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1955 Data size: 183770 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 1955 Data size: 183770 Basic stats: > COMPLETE Column stats: COMPLETE > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)