[ 
https://issues.apache.org/jira/browse/SPARK-13859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15200465#comment-15200465
 ] 

Dilip Biswal commented on SPARK-13859:
--------------------------------------

Hello,

Just checked the original spec for this query from tpcds website. Here is the 
template for Q38.

{code}
[_LIMITA] select [_LIMITB] count(*) from (
    select distinct c_last_name, c_first_name, d_date
    from store_sales, date_dim, customer
          where store_sales.ss_sold_date_sk = date_dim.d_date_sk
      and store_sales.ss_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from catalog_sales, date_dim, customer
          where catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
      and catalog_sales.cs_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
  intersect
    select distinct c_last_name, c_first_name, d_date
    from web_sales, date_dim, customer
          where web_sales.ws_sold_date_sk = date_dim.d_date_sk
      and web_sales.ws_bill_customer_sk = customer.c_customer_sk
      and d_month_seq between [DMS] and [DMS] + 11
) hot_cust
[_LIMITC];
{code}

In this case the query in spec uses intersect operator where the  implicitly 
generated join conditions use null safe comparison. 
In other-words, if we ran the query as is from spec then it would have worked.

However the query in this JIRA has user supplied join conditions and uses "=". 
In my knowledge in SQL, the semantics
of equal operator is well defined. So i don't think its a spark SQL issue. 

[~rxin] [~marmbrus] Please let us know your thoughts..



> TPCDS query 38 returns wrong results compared to TPC official result set 
> -------------------------------------------------------------------------
>
>                 Key: SPARK-13859
>                 URL: https://issues.apache.org/jira/browse/SPARK-13859
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.6.0
>            Reporter: JESSE CHEN
>              Labels: tpcds-result-mismatch
>
> Testing Spark SQL using TPC queries. Query 38 returns wrong results compared 
> to official result set. This is at 1GB SF (validation run).
> SparkSQL returns count of 0, answer set reports 107.
> Actual results:
> {noformat}
> [0]
> {noformat}
> Expected:
> {noformat}
> +-----+
> |   1 |
> +-----+
> | 107 |
> +-----+
> {noformat}
> query used:
> {noformat}
> -- start query 38 in stream 0 using template query38.tpl and seed 
> QUALIFICATION
>  select  count(*) from (
>     select distinct c_last_name, c_first_name, d_date
>     from store_sales
>          JOIN date_dim ON store_sales.ss_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON store_sales.ss_customer_sk = customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp1
>   JOIN
>     (select distinct c_last_name, c_first_name, d_date
>     from catalog_sales
>          JOIN date_dim ON catalog_sales.cs_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON catalog_sales.cs_bill_customer_sk = 
> customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp2 ON (tmp1.c_last_name = 
> tmp2.c_last_name) and (tmp1.c_first_name = tmp2.c_first_name) and 
> (tmp1.d_date = tmp2.d_date) 
>   JOIN
>     (
>     select distinct c_last_name, c_first_name, d_date
>     from web_sales
>          JOIN date_dim ON web_sales.ws_sold_date_sk = date_dim.d_date_sk
>          JOIN customer ON web_sales.ws_bill_customer_sk = 
> customer.c_customer_sk
>     where d_month_seq between 1200 and 1200 + 11) tmp3 ON (tmp1.c_last_name = 
> tmp3.c_last_name) and (tmp1.c_first_name = tmp3.c_first_name) and 
> (tmp1.d_date = tmp3.d_date) 
>   limit 100
>  ;
> -- end query 38 in stream 0 using template query38.tpl
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to