[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

JESSE CHEN (JIRA) Wed, 18 May 2016 10:14:42 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15289356#comment-15289356
 ]


JESSE CHEN commented on SPARK-15372:
------------------------------------

[~freiss] I agree with you. 
In addition, in order to match TPC official result, we are allowed to use minor 
query modification for treatment of null strings (e.g., coalesce), so the 
following query now runs and returns the matching results to TPC:
{noformat}
  select  c_customer_id as customer_id
       ,concat(c_last_name , ', ' , coalesce(c_first_name,'')) as customername
 from customer
     ,customer_address
     ,customer_demographics
     ,household_demographics
     ,income_band
     ,store_returns
 where ca_city          =  'Edgewood'
   and c_current_addr_sk = ca_address_sk
   and ib_lower_bound   >=  38128
   and ib_upper_bound   <=  38128 + 50000
   and ib_income_band_sk = hd_income_band_sk
   and cd_demo_sk = c_current_cdemo_sk
   and hd_demo_sk = c_current_hdemo_sk
   and sr_cdemo_sk = cd_demo_sk
 order by c_customer_id
  limit 100;
{noformat}

> TPC-DS Qury 84 returns wrong results against TPC official
> ---------------------------------------------------------
>
>                 Key: SPARK-15372
>                 URL: https://issues.apache.org/jira/browse/SPARK-15372
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>            Reporter: JESSE CHEN
>            Assignee: Herman van Hovell
>            Priority: Critical
>              Labels: SPARK-15071
>
> The official TPC-DS query 84 returns wrong results when compared to its 
> official answer set.
> The query itself is:
> {noformat}
>   select  c_customer_id as customer_id
>        ,concat(c_last_name , ', ' , c_first_name) as customername
>  from customer
>      ,customer_address
>      ,customer_demographics
>      ,household_demographics
>      ,income_band
>      ,store_returns
>  where ca_city          =  'Edgewood'
>    and c_current_addr_sk = ca_address_sk
>    and ib_lower_bound   >=  38128
>    and ib_upper_bound   <=  38128 + 50000
>    and ib_income_band_sk = hd_income_band_sk
>    and cd_demo_sk = c_current_cdemo_sk
>    and hd_demo_sk = c_current_hdemo_sk
>    and sr_cdemo_sk = cd_demo_sk
>  order by c_customer_id
>   limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AAAAAAAAAIPGAAAA      Carter, Rodney
> AAAAAAAAAKMBBAAA      Mcarthur, Emma
> AAAAAAAACBNHBAAA      Wells, Ron
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADBMEAAAA      Vera, Tina
> AAAAAAAADHKGBAAA      Scott, Pamela
> AAAAAAAAEIIBBAAA      Atkins, Susan
> AAAAAAAAFKAHAAAA      Batiste, Ernest
> AAAAAAAAGHMAAAAA      Mitchell, Gregory
> AAAAAAAAIAODBAAA      Murray, Karen
> AAAAAAAAIEOKAAAA      Solomon, Clyde
> AAAAAAAAIIBOAAAA      Owens, David
> AAAAAAAAIPDCAAAA      Wallace, Eric
> AAAAAAAAIPIMAAAA      Hayward, Benjamin
> AAAAAAAAJCIKAAAA      Ramos, Donald
> AAAAAAAAKFJEAAAA      Roberts, Yvonne
> AAAAAAAAKPGBBAAA      NULL <------------------------ ??? questionable row
> AAAAAAAALCLABAAA      Whitaker, Lettie
> AAAAAAAAMGMEAAAA      Sharp, Michael
> AAAAAAAAMIGBBAAA      Montgomery, Jesenia
> AAAAAAAAMPDKAAAA      Lopez, Isabel
> AAAAAAAANEOMAAAA      Powell, Linda
> AAAAAAAANKPCAAAA      Shaffer, Sergio
> AAAAAAAANOCKAAAA      Vargas, James
> AAAAAAAAOGJEBAAA      Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AAAAAAAAAIPGAAAA Carter                      , Rodney
> AAAAAAAAAKMBBAAA Mcarthur                    , Emma
> AAAAAAAACBNHBAAA Wells                               , Ron
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADBMEAAAA Vera                        , Tina
> AAAAAAAADHKGBAAA Scott                               , Pamela
> AAAAAAAAEIIBBAAA Atkins                      , Susan
> AAAAAAAAFKAHAAAA Batiste                     , Ernest
> AAAAAAAAGHMAAAAA Mitchell                    , Gregory
> AAAAAAAAIAODBAAA Murray                      , Karen
> AAAAAAAAIEOKAAAA Solomon                     , Clyde
> AAAAAAAAIIBOAAAA Owens                               , David
> AAAAAAAAIPDCAAAA Wallace                     , Eric
> AAAAAAAAIPIMAAAA Hayward                     , Benjamin
> AAAAAAAAJCIKAAAA Ramos                               , Donald
> AAAAAAAAKFJEAAAA Roberts                     , Yvonne
> AAAAAAAAKPGBBAAA Moore                               ,
> AAAAAAAALCLABAAA Whitaker                    , Lettie
> AAAAAAAAMGMEAAAA Sharp                               , Michael
> AAAAAAAAMIGBBAAA Montgomery                  , Jesenia
> AAAAAAAAMPDKAAAA Lopez                               , Isabel
> AAAAAAAANEOMAAAA Powell                      , Linda
> AAAAAAAANKPCAAAA Shaffer                     , Sergio
> AAAAAAAANOCKAAAA Vargas                      , James
> AAAAAAAAOGJEBAAA Owens                               , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same 
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer. 
> But is this right? When I concatenate a person's last name and first name, if 
> the first name is missing (empty string or NULL), I should see the last name 
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL. 
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.055 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_customer_id) from customer where 
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> Moore, AAAAAAAAKPGBBAAA
> Same in 'spark-sql' shell:
> ...
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
> select concat(c_last_name, c_first_name) from customer where c_customer_id = 
> 'AAAAAAAAKPGBBAAA';
> NULL
> Time taken: 0.296 seconds, Fetched 1 row(s)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15372) TPC-DS Qury 84 returns wrong results against TPC official

Reply via email to