[
https://issues.apache.org/jira/browse/SPARK-15372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15288457#comment-15288457
]
Reynold Xin commented on SPARK-15372:
-------------------------------------
[~freiss] you are saying this is not a problem right?
> TPC-DS Qury 84 returns wrong results against TPC official
> ---------------------------------------------------------
>
> Key: SPARK-15372
> URL: https://issues.apache.org/jira/browse/SPARK-15372
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Reporter: JESSE CHEN
> Assignee: Herman van Hovell
> Priority: Critical
> Labels: SPARK-15071
>
> The official TPC-DS query 84 returns wrong results when compared to its
> official answer set.
> The query itself is:
> {noformat}
> select c_customer_id as customer_id
> ,concat(c_last_name , ', ' , c_first_name) as customername
> from customer
> ,customer_address
> ,customer_demographics
> ,household_demographics
> ,income_band
> ,store_returns
> where ca_city = 'Edgewood'
> and c_current_addr_sk = ca_address_sk
> and ib_lower_bound >= 38128
> and ib_upper_bound <= 38128 + 50000
> and ib_income_band_sk = hd_income_band_sk
> and cd_demo_sk = c_current_cdemo_sk
> and hd_demo_sk = c_current_hdemo_sk
> and sr_cdemo_sk = cd_demo_sk
> order by c_customer_id
> limit 100;
> {noformat}
> Spark 2.0 build 0517 returned the following result:
> {noformat}
> AAAAAAAAAIPGAAAA Carter, Rodney
> AAAAAAAAAKMBBAAA Mcarthur, Emma
> AAAAAAAACBNHBAAA Wells, Ron
> AAAAAAAADBMEAAAA Vera, Tina
> AAAAAAAADBMEAAAA Vera, Tina
> AAAAAAAADHKGBAAA Scott, Pamela
> AAAAAAAAEIIBBAAA Atkins, Susan
> AAAAAAAAFKAHAAAA Batiste, Ernest
> AAAAAAAAGHMAAAAA Mitchell, Gregory
> AAAAAAAAIAODBAAA Murray, Karen
> AAAAAAAAIEOKAAAA Solomon, Clyde
> AAAAAAAAIIBOAAAA Owens, David
> AAAAAAAAIPDCAAAA Wallace, Eric
> AAAAAAAAIPIMAAAA Hayward, Benjamin
> AAAAAAAAJCIKAAAA Ramos, Donald
> AAAAAAAAKFJEAAAA Roberts, Yvonne
> AAAAAAAAKPGBBAAA NULL <------------------------ ??? questionable row
> AAAAAAAALCLABAAA Whitaker, Lettie
> AAAAAAAAMGMEAAAA Sharp, Michael
> AAAAAAAAMIGBBAAA Montgomery, Jesenia
> AAAAAAAAMPDKAAAA Lopez, Isabel
> AAAAAAAANEOMAAAA Powell, Linda
> AAAAAAAANKPCAAAA Shaffer, Sergio
> AAAAAAAANOCKAAAA Vargas, James
> AAAAAAAAOGJEBAAA Owens, Denice
> {noformat}
> Official answer set (which is correct!)
> {noformat}
> AAAAAAAAAIPGAAAA Carter , Rodney
> AAAAAAAAAKMBBAAA Mcarthur , Emma
> AAAAAAAACBNHBAAA Wells , Ron
> AAAAAAAADBMEAAAA Vera , Tina
> AAAAAAAADBMEAAAA Vera , Tina
> AAAAAAAADHKGBAAA Scott , Pamela
> AAAAAAAAEIIBBAAA Atkins , Susan
> AAAAAAAAFKAHAAAA Batiste , Ernest
> AAAAAAAAGHMAAAAA Mitchell , Gregory
> AAAAAAAAIAODBAAA Murray , Karen
> AAAAAAAAIEOKAAAA Solomon , Clyde
> AAAAAAAAIIBOAAAA Owens , David
> AAAAAAAAIPDCAAAA Wallace , Eric
> AAAAAAAAIPIMAAAA Hayward , Benjamin
> AAAAAAAAJCIKAAAA Ramos , Donald
> AAAAAAAAKFJEAAAA Roberts , Yvonne
> AAAAAAAAKPGBBAAA Moore ,
> AAAAAAAALCLABAAA Whitaker , Lettie
> AAAAAAAAMGMEAAAA Sharp , Michael
> AAAAAAAAMIGBBAAA Montgomery , Jesenia
> AAAAAAAAMPDKAAAA Lopez , Isabel
> AAAAAAAANEOMAAAA Powell , Linda
> AAAAAAAANKPCAAAA Shaffer , Sergio
> AAAAAAAANOCKAAAA Vargas , James
> AAAAAAAAOGJEBAAA Owens , Denice
> {noformat}
> The issue is with the "concat" function in Spark SQL (also behaves the same
> in Hive). When 'concat' meets any NULL string, it returns NULL as the answer.
> But is this right? When I concatenate a person's last name and first name, if
> the first name is missing (empty string or NULL), I should see the last name
> still, not NULL, i.e., "Smith" + "" = "Smith", not NULL.
> Simplest repeatable test:
> {noformat}
> hive> select c_first_name, c_last_name from customer where c_customer_id =
> 'AAAAAAAAKPGBBAAA';
> OK
> NULL Moore
> Time taken: 0.07 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.1 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.055 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_first_name) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> NULL
> Time taken: 0.061 seconds, Fetched: 1 row(s)
> hive> select concat(c_last_name, ', ', c_customer_id) from customer where
> c_customer_id = 'AAAAAAAAKPGBBAAA';
> OK
> Moore, AAAAAAAAKPGBBAAA
> Same in 'spark-sql' shell:
> ...
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 45
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 46
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 47
> 16/05/17 15:57:10 INFO spark.ContextCleaner: Cleaned accumulator 48
> select concat(c_last_name, c_first_name) from customer where c_customer_id =
> 'AAAAAAAAKPGBBAAA';
> NULL
> Time taken: 0.296 seconds, Fetched 1 row(s)
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]