[
https://issues.apache.org/jira/browse/HIVE-6099?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13873093#comment-13873093
]
Navis commented on HIVE-6099:
-----------------------------
sure
{noformat}
hive> explain select a.dt,count(distinct a.user) as AllDist,count(distinct case
when a.type = 111 then a.user else null end) as Type111User,
> count(distinct case when a.type != 111 then a.user else null end) as
Type123User from Table_A a
> group by a.dt
> ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME Table_A) a)) (TOK_INSERT
(TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (.
(TOK_TABLE_OR_COL a) dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count (.
(TOK_TABLE_OR_COL a) user)) AllDist) (TOK_SELEXPR (TOK_FUNCTIONDI count
(TOK_FUNCTION when (= (. (TOK_TABLE_OR_COL a) type) 111) (. (TOK_TABLE_OR_COL
a) user) TOK_NULL)) Type111User) (TOK_SELEXPR (TOK_FUNCTIONDI count
(TOK_FUNCTION when (!= (. (TOK_TABLE_OR_COL a) type) 111) (. (TOK_TABLE_OR_COL
a) user) TOK_NULL)) Type123User)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) dt))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Select Operator
expressions:
expr: dt
type: string
expr: user
type: string
expr: type
type: int
outputColumnNames: dt, user, type
Group By Operator
aggregations:
expr: count(DISTINCT user)
expr: count(DISTINCT CASE WHEN ((type = 111)) THEN (user)
ELSE (null) END)
expr: count(DISTINCT CASE WHEN ((type <> 111)) THEN
(user) ELSE (null) END)
bucketGroup: false
keys:
expr: dt
type: string
expr: user
type: string
expr: CASE WHEN ((type = 111)) THEN (user) ELSE (null) END
type: string
expr: CASE WHEN ((type <> 111)) THEN (user) ELSE (null)
END
type: string
mode: hash
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6
Reduce Output Operator
key expressions:
expr: _col0
type: string
expr: _col1
type: string
expr: _col2
type: string
expr: _col3
type: string
sort order: ++++
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col4
type: bigint
expr: _col5
type: bigint
expr: _col6
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(DISTINCT KEY._col1:0._col0)
expr: count(DISTINCT KEY._col1:1._col0)
expr: count(DISTINCT KEY._col1:2._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1, _col2, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
expr: _col2
type: bigint
expr: _col3
type: bigint
outputColumnNames: _col0, _col1, _col2, _col3
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
> Multi insert does not work properly with distinct count
> -------------------------------------------------------
>
> Key: HIVE-6099
> URL: https://issues.apache.org/jira/browse/HIVE-6099
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Affects Versions: 0.9.0, 0.10.0
> Reporter: Pavan Gadam Manohar
> Labels: count, distinct, insert, multi-insert
>
> Need 2 rows to reproduce this Bug. Here are the steps.
> Step 1) Create a table Table_A
> CREATE EXTERNAL TABLE Table_A
> (
> user string
> , type int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS RCFILE
> LOCATION '/hive/<path>/Table_A';
> Step 2) Scenario: Lets us say consider user tommy belong to both usertypes
> 111 and 123. Insert 2 records into the table created above.
> select * from Table_A;
> hive> select * from table_a;
> OK
> tommy 123 2013-12-02
> tommy 111 2013-12-02
> Step 3) Create 2 destination tables to simulate multi-insert.
> CREATE EXTERNAL TABLE dest_Table_A
> (
> p_date string
> , Distinct_Users int
> , Type111Users int
> , Type123Users int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS RCFILE
> LOCATION '/hive/<path>/dest_Table_A';
>
> CREATE EXTERNAL TABLE dest_Table_B
> (
> p_date string
> , Distinct_Users int
> , Type111Users int
> , Type123Users int
> )
> PARTITIONED BY (dt string)
> ROW FORMAT DELIMITED
> FIELDS TERMINATED BY '|'
> STORED AS RCFILE
> LOCATION '/hive/<path>/dest_Table_B';
> Step 4) Multi insert statement
> from Table_A a
> INSERT OVERWRITE TABLE dest_Table_A PARTITION(dt='2013-12-02')
> select a.dt
> ,count(distinct a.user) as AllDist
> ,count(distinct case when a.type = 111 then a.user else null end) as
> Type111User
> ,count(distinct case when a.type != 111 then a.user else null end) as
> Type123User
> group by a.dt
>
> INSERT OVERWRITE TABLE dest_Table_B PARTITION(dt='2013-12-02')
> select a.dt
> ,count(distinct a.user) as AllDist
> ,count(distinct case when a.type = 111 then a.user else null end) as
> Type111User
> ,count(distinct case when a.type != 111 then a.user else null end) as
> Type123User
> group by a.dt
> ;
>
> Step 5) Verify results.
> hive> select * from dest_table_a;
> OK
> 2013-12-02 2 1 1 2013-12-02
> Time taken: 0.116 seconds
> hive> select * from dest_table_b;
> OK
> 2013-12-02 2 1 1 2013-12-02
> Time taken: 0.13 seconds
> Conclusion: Hive gives a count of 2 for distinct users although there is
> only one distinct user. After trying many datasets observed that Hive is
> doing Type111Users + Typoe123Users = DistinctUsers which is wrong.
> hive> select count(distinct a.user) from table_a a;
> Gives:
> Total MapReduce CPU Time Spent: 4 seconds 350 msec
> OK
> 1
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)