[ https://issues.apache.org/jira/browse/HIVE-5357?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yin Huai updated HIVE-5357: --------------------------- Description: Example: {code} select key, count(distinct value) from (select key, value from src group by key, value) t group by key; //result 0 0 NULL 10 10 NULL 100 100 NULL 103 103 NULL 104 104 NULL {code} Obviously the result is wrong. When we have a simple group by query with a distinct column {code} explain select count(distinct value) from src group by key; {code} The plan is {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: src TableScan alias: src Select Operator expressions: expr: key type: string expr: value type: string outputColumnNames: key, value Group By Operator aggregations: expr: count(DISTINCT value) bucketGroup: false keys: expr: key type: string expr: value type: string mode: hash outputColumnNames: _col0, _col1, _col2 Reduce Output Operator key expressions: expr: _col0 type: string expr: _col1 type: string sort order: ++ Map-reduce partition columns: expr: _col0 type: string tag: -1 value expressions: expr: _col2 type: bigint Reduce Operator Tree: Group By Operator aggregations: expr: count(DISTINCT KEY._col1:0._col0) bucketGroup: false keys: expr: KEY._col0 type: string mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col1 type: bigint outputColumnNames: _col0 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 {code} The map side GBY also adds the distinct columns (value in this case) to its key columns. When RSDedup optimizes a query involving a GBY with distinct keys, if map-side aggregation is enabled, currently it assigns the map-side GBY's key columns to the reduce-side GBY. So, for the example shown at the beginning, after we generate a plan with a single MR job, the second GBY in the reduce-side uses both key and value as its key columns. The correct key column is key. was: {code} select key, count(distinct value) from (select key, value from src group by key, value) t group by key; //result 0 0 NULL 10 10 NULL 100 100 NULL 103 103 NULL 104 104 NULL {code} Obviously the result is wrong. > ReduceSinkDeDuplication optimizer pick the wrong keys in pRS-cGBYm-cRS-cGBYr > scenario when there are distinct keys in child GBY > ------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-5357 > URL: https://issues.apache.org/jira/browse/HIVE-5357 > Project: Hive > Issue Type: Bug > Components: Query Processor > Affects Versions: 0.11.0 > Reporter: Chun Chen > Assignee: Chun Chen > Priority: Blocker > Fix For: 0.12.0 > > Attachments: HIVE-5357.patch > > > Example: > {code} > select key, count(distinct value) from (select key, value from src group by > key, value) t group by key; > //result > 0 0 NULL > 10 10 NULL > 100 100 NULL > 103 103 NULL > 104 104 NULL > {code} > Obviously the result is wrong. > When we have a simple group by query with a distinct column > {code} > explain select count(distinct value) from src group by key; > {code} > The plan is > {code} > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > src > TableScan > alias: src > Select Operator > expressions: > expr: key > type: string > expr: value > type: string > outputColumnNames: key, value > Group By Operator > aggregations: > expr: count(DISTINCT value) > bucketGroup: false > keys: > expr: key > type: string > expr: value > type: string > mode: hash > outputColumnNames: _col0, _col1, _col2 > Reduce Output Operator > key expressions: > expr: _col0 > type: string > expr: _col1 > type: string > sort order: ++ > Map-reduce partition columns: > expr: _col0 > type: string > tag: -1 > value expressions: > expr: _col2 > type: bigint > Reduce Operator Tree: > Group By Operator > aggregations: > expr: count(DISTINCT KEY._col1:0._col0) > bucketGroup: false > keys: > expr: KEY._col0 > type: string > mode: mergepartial > outputColumnNames: _col0, _col1 > Select Operator > expressions: > expr: _col1 > type: bigint > outputColumnNames: _col0 > File Output Operator > compressed: false > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > {code} > The map side GBY also adds the distinct columns (value in this case) to its > key columns. > When RSDedup optimizes a query involving a GBY with distinct keys, if > map-side aggregation is enabled, currently it assigns the map-side GBY's key > columns to the reduce-side GBY. So, for the example shown at the beginning, > after we generate a plan with a single MR job, the second GBY in the > reduce-side uses both key and value as its key columns. The correct key > column is key. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira