[jira] [Commented] (SPARK-24440) When use constant as column we may get wrong answer versus impala

Marco Gaido (JIRA) Wed, 26 Sep 2018 04:25:13 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-24440?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628608#comment-16628608
 ]


Marco Gaido commented on SPARK-24440:
-------------------------------------

Can you provide a sample repro which can be run in order to debug the issue?

> When use constant as column we may get wrong answer versus impala
> -----------------------------------------------------------------
>
>                 Key: SPARK-24440
>                 URL: https://issues.apache.org/jira/browse/SPARK-24440
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.1.0, 2.3.0
>            Reporter: zhoukang
>            Priority: Major
>
> For query below:
> {code:java}
> select `date`, 100 as platform, count(distinct deviceid) as new_user from 
> tv.clean_new_user where `date`=20180528 group by `date`, platform
> {code}
> We intended to group by 100 and get distinct deviceid number.
> By spark sql,we get:
> {code}
> +-----------+-----------+-----------+--+
> |   date    | platform  | new_user  |
> +-----------+-----------+-----------+--+
> | 20180528  | 100       | 521       |
> | 20180528  | 100       | 82        |
> | 20180528  | 100       | 3         |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 7         |
> | 20180528  | 100       | 870       |
> | 20180528  | 100       | 3         |
> | 20180528  | 100       | 8         |
> | 20180528  | 100       | 3         |
> | 20180528  | 100       | 2204      |
> | 20180528  | 100       | 1123      |
> | 20180528  | 100       | 1         |
> | 20180528  | 100       | 54        |
> | 20180528  | 100       | 440       |
> | 20180528  | 100       | 4         |
> | 20180528  | 100       | 478       |
> | 20180528  | 100       | 34        |
> | 20180528  | 100       | 195       |
> | 20180528  | 100       | 17        |
> | 20180528  | 100       | 18        |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 84        |
> | 20180528  | 100       | 1616      |
> | 20180528  | 100       | 15        |
> | 20180528  | 100       | 7         |
> | 20180528  | 100       | 479       |
> | 20180528  | 100       | 50        |
> | 20180528  | 100       | 376       |
> | 20180528  | 100       | 21        |
> | 20180528  | 100       | 842       |
> | 20180528  | 100       | 444       |
> | 20180528  | 100       | 538       |
> | 20180528  | 100       | 1         |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 7         |
> | 20180528  | 100       | 17        |
> | 20180528  | 100       | 133       |
> | 20180528  | 100       | 7         |
> | 20180528  | 100       | 415       |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 318       |
> | 20180528  | 100       | 5         |
> | 20180528  | 100       | 1         |
> | 20180528  | 100       | 2060      |
> | 20180528  | 100       | 1217      |
> | 20180528  | 100       | 2         |
> | 20180528  | 100       | 60        |
> | 20180528  | 100       | 22        |
> | 20180528  | 100       | 4         |
> +-----------+-----------+-----------+--+
> {code}
> Actually sum of the deviceid is below:
> {code}
> 0: jdbc:hive2://xxx/> select sum(t1.new_user) from (select `date`, 100 as 
> platform, count(distinct deviceid) as new_user from tv.clean_new_user where 
> `date`=20180528 group by `date`, platform)t1; 
> +----------------+--+
> | sum(new_user)  |
> +----------------+--+
> | 14816          |
> +----------------+--+
> 1 row selected (4.934 seconds)
> {code}
> And the real distinct deviceid value is below:
> {code}
> 0: jdbc:hive2://xxx/> select 100 as platform, count(distinct deviceid) as 
> new_user from tv.clean_new_user where `date`=20180528;
> +-----------+-----------+--+
> | platform  | new_user  |
> +-----------+-----------+--+
> | 100       | 14773     |
> +-----------+-----------+--+
> 1 row selected (2.846 seconds)
> {code}
> In impala,with the first query we can get result below:
> {code}
> [xxx] > select `date`, 100 as platform, count(distinct deviceid) as new_user 
> from tv.clean_new_user where `date`=20180528 group by `date`, platform;Query: 
> select `date`, 100 as platform, count(distinct deviceid) as new_user from 
> tv.clean_new_user where `date`=20180528 group by `date`, platform
> +----------+----------+----------+
> | date     | platform | new_user |
> +----------+----------+----------+
> | 20180528 | 100      | 14773    |
> +----------+----------+----------+
> Fetched 1 row(s) in 1.00s
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-24440) When use constant as column we may get wrong answer versus impala

Reply via email to