zhoukang created SPARK-24440:
--------------------------------

             Summary: When use constant as column we may get wrong answer 
versus impala
                 Key: SPARK-24440
                 URL: https://issues.apache.org/jira/browse/SPARK-24440
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.3.0, 2.1.0
            Reporter: zhoukang


For query below:

{code:java}
select `date`, 100 as platform, count(distinct deviceid) as new_user from 
tv.clean_new_user where `date`=20180528 group by `date`, platform
{code}
We intended to group by 100 and get distinct deviceid number.
By spark sql,we get:
{code}
+-----------+-----------+-----------+--+
|   date    | platform  | new_user  |
+-----------+-----------+-----------+--+
| 20180528  | 100       | 521       |
| 20180528  | 100       | 82        |
| 20180528  | 100       | 3         |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 7         |
| 20180528  | 100       | 870       |
| 20180528  | 100       | 3         |
| 20180528  | 100       | 8         |
| 20180528  | 100       | 3         |
| 20180528  | 100       | 2204      |
| 20180528  | 100       | 1123      |
| 20180528  | 100       | 1         |
| 20180528  | 100       | 54        |
| 20180528  | 100       | 440       |
| 20180528  | 100       | 4         |
| 20180528  | 100       | 478       |
| 20180528  | 100       | 34        |
| 20180528  | 100       | 195       |
| 20180528  | 100       | 17        |
| 20180528  | 100       | 18        |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 84        |
| 20180528  | 100       | 1616      |
| 20180528  | 100       | 15        |
| 20180528  | 100       | 7         |
| 20180528  | 100       | 479       |
| 20180528  | 100       | 50        |
| 20180528  | 100       | 376       |
| 20180528  | 100       | 21        |
| 20180528  | 100       | 842       |
| 20180528  | 100       | 444       |
| 20180528  | 100       | 538       |
| 20180528  | 100       | 1         |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 7         |
| 20180528  | 100       | 17        |
| 20180528  | 100       | 133       |
| 20180528  | 100       | 7         |
| 20180528  | 100       | 415       |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 318       |
| 20180528  | 100       | 5         |
| 20180528  | 100       | 1         |
| 20180528  | 100       | 2060      |
| 20180528  | 100       | 1217      |
| 20180528  | 100       | 2         |
| 20180528  | 100       | 60        |
| 20180528  | 100       | 22        |
| 20180528  | 100       | 4         |
+-----------+-----------+-----------+--+
{code}
Actually sum of the deviceid is below:
{code}
0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as 
platform, count(distinct deviceid) as new_user from tv.clean_new_user where 
`date`=20180528 group by `date`, platform)t1; 
+----------------+--+
| sum(new_user)  |
+----------------+--+
| 14816          |
+----------------+--+
1 row selected (4.934 seconds)
{code}
And the real distinct deviceid value is below:
{code}
0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as 
new_user from tv.clean_new_user where `date`=20180528;
+-----------+-----------+--+
| platform  | new_user  |
+-----------+-----------+--+
| 100       | 14773     |
+-----------+-----------+--+
1 row selected (2.846 seconds)
{code}

In impala,with the first query we can get result below:
{code}
[xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user 
from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: 
select `date`, 100 as platform, count(distinct deviceid) as new_user from 
tv.clean_new_user where `date`=20180528 group by `date`, platform
+----------+----------+----------+
| date     | platform | new_user |
+----------+----------+----------+
| 20180528 | 100      | 14773    |
+----------+----------+----------+
Fetched 1 row(s) in 1.00s
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to