[
https://issues.apache.org/jira/browse/SPARK-36926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gengliang Wang reassigned SPARK-36926:
--------------------------------------
Assignee: Wenchen Fan (was: Gengliang Wang)
> Discrepancy in Q22 of TPCH for Spark 3.2
> ----------------------------------------
>
> Key: SPARK-36926
> URL: https://issues.apache.org/jira/browse/SPARK-36926
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 3.2.0
> Reporter: Aravind Patnam
> Assignee: Wenchen Fan
> Priority: Blocker
> Fix For: 3.2.0
>
>
> When running TPCH scale 100 against 3.2, Query 22 has a discrepancy in the
> number of rows returned by the query. This was tested with both AQE on and
> off. All the other queries were matching in results. Below is the results
> that we got when testing Q22 on 3.2:
>
> {code:java}
> "results": [
> {
> "name": "Q22",
> "mode": "collect",
> "parameters": {},
> "joinTypes": [
> "SortMergeJoin"
> ],
> "tables": [
> "customer"
> ],
> "parsingTime": 0.016522,
> "analysisTime": 0.004132,
> "optimizationTime": 39.173868,
> "planningTime": 23.10939,
> "executionTime": 13762.183844,
> "result": 0,
> "breakDown": [],
> "queryExecution": "== Parsed Logical Plan ==\n'Sort ['cntrycode ASC
> NULLS FIRST], true\n+- 'Aggregate ['cntrycode], ['cntrycode, 'count(1) AS
> numcust#150, 'sum('c_acctbal) AS totacctbal#151]\n +- 'SubqueryAlias
> custsale\n +- 'Project ['substring('c_phone, 1, 2) AS cntrycode#147,
> 'c_acctbal]\n +- 'Filter (('substring('c_phone, 1, 2) IN
> (13,31,23,29,30,18,17) AND ('c_acctbal > scalar-subquery#148 [])) AND NOT
> exists#149 [])\n : :- 'Project [unresolvedalias('avg('c_acctbal),
> None)]\n : : +- 'Filter (('c_acctbal > 0.00) AND
> 'substring('c_phone, 1, 2) IN (13,31,23,29,30,18,17))\n : :
> +- 'UnresolvedRelation [customer], [], false\n : +- 'Project
> [*]\n : +- 'Filter ('o_custkey = 'c_custkey)\n :
> +- 'UnresolvedRelation [orders], [], false\n +-
> 'UnresolvedRelation [customer], [], false\n\n== Analyzed Logical Plan
> ==\ncntrycode: string, numcust: bigint, totacctbal: decimal(22,2)\nSort
> [cntrycode#147 ASC NULLS FIRST], true\n+- Aggregate [cntrycode#147],
> [cntrycode#147, count(1) AS numcust#150L, sum(c_acctbal#11) AS
> totacctbal#151]\n +- SubqueryAlias custsale\n +- Project
> [substring(c_phone#10, 1, 2) AS cntrycode#147, c_acctbal#11]\n +-
> Filter ((substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17) AND
> (cast(c_acctbal#11 as decimal(16,6)) > cast(scalar-subquery#148 [] as
> decimal(16,6)))) AND NOT exists#149 [c_custkey#6L])\n : :-
> Aggregate [avg(c_acctbal#160) AS avg(c_acctbal)#154]\n : : +-
> Filter ((cast(c_acctbal#160 as decimal(12,2)) > cast(0.00 as decimal(12,2)))
> AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n : :
> +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n :
> : +- Relation
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
> orc\n : +- Project [o_orderkey#16L, o_custkey#17L,
> o_orderstatus#18, o_totalprice#19, o_orderpriority#20, o_clerk#21,
> o_shippriority#22, o_comment#23, o_orderdate#24]\n : +- Filter
> (o_custkey#17L = outer(c_custkey#6L))\n : +- SubqueryAlias
> spark_catalog.tpch_data_orc_100.orders\n : +- Relation
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
> orc\n +- SubqueryAlias spark_catalog.tpch_data_orc_100.customer\n
> +- Relation
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
> orc\n\n== Optimized Logical Plan ==\nSort [cntrycode#147 ASC NULLS FIRST],
> true\n+- Aggregate [cntrycode#147], [cntrycode#147, count(1) AS numcust#150L,
> sum(c_acctbal#11) AS totacctbal#151]\n +- Project [substring(c_phone#10, 1,
> 2) AS cntrycode#147, c_acctbal#11]\n +- Join LeftAnti, (o_custkey#17L =
> c_custkey#6L)\n :- Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n
> : +- Filter ((isnotnull(c_acctbal#11) AND substring(c_phone#10, 1,
> 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11 as decimal(16,6)) >
> scalar-subquery#148 []))\n : : +- Aggregate [avg(c_acctbal#160)
> AS avg(c_acctbal)#154]\n : : +- Project [c_acctbal#160]\n
> : : +- Filter (isnotnull(c_acctbal#160) AND ((c_acctbal#160 >
> 0.00) AND substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17)))\n
> : : +- Relation
> tpch_data_orc_100.customer[c_custkey#155L,c_name#156,c_address#157,c_nationkey#158L,c_phone#159,c_acctbal#160,c_comment#161,c_mktsegment#162]
> orc\n : +- Relation
> tpch_data_orc_100.customer[c_custkey#6L,c_name#7,c_address#8,c_nationkey#9L,c_phone#10,c_acctbal#11,c_comment#12,c_mktsegment#13]
> orc\n +- Project [o_custkey#17L]\n +- Relation
> tpch_data_orc_100.orders[o_orderkey#16L,o_custkey#17L,o_orderstatus#18,o_totalprice#19,o_orderpriority#20,o_clerk#21,o_shippriority#22,o_comment#23,o_orderdate#24]
> orc\n\n== Physical Plan ==\n*(7) Sort [cntrycode#147 ASC NULLS FIRST], true,
> 0\n+- Exchange rangepartitioning(cntrycode#147 ASC NULLS FIRST, 200),
> ENSURE_REQUIREMENTS, [id=#314]\n +- *(6)
> HashAggregate(keys=[cntrycode#147], functions=[count(1), sum(c_acctbal#11)],
> output=[cntrycode#147, numcust#150L, totacctbal#151])\n +- Exchange
> hashpartitioning(cntrycode#147, 200), ENSURE_REQUIREMENTS, [id=#310]\n
> +- *(5) HashAggregate(keys=[cntrycode#147], functions=[partial_count(1),
> partial_sum(c_acctbal#11)], output=[cntrycode#147, count#170L, sum#171,
> isEmpty#172])\n +- *(5) Project [substring(c_phone#10, 1, 2) AS
> cntrycode#147, c_acctbal#11]\n +- *(5) SortMergeJoin
> [c_custkey#6L], [o_custkey#17L], LeftAnti\n :- *(2) Sort
> [c_custkey#6L ASC NULLS FIRST], false, 0\n : +- Exchange
> hashpartitioning(c_custkey#6L, 200), ENSURE_REQUIREMENTS, [id=#292]\n
> : +- *(1) Project [c_custkey#6L, c_phone#10, c_acctbal#11]\n
> : +- *(1) Filter ((isnotnull(c_acctbal#11) AND
> substring(c_phone#10, 1, 2) IN (13,31,23,29,30,18,17)) AND (cast(c_acctbal#11
> as decimal(16,6)) > Subquery scalar-subquery#148, [id=#249]))\n
> : : +- Subquery scalar-subquery#148, [id=#249]\n
> : : +- *(2) HashAggregate(keys=[],
> functions=[avg(c_acctbal#160)], output=[avg(c_acctbal)#154])\n
> : : +- Exchange SinglePartition, ENSURE_REQUIREMENTS,
> [id=#245]\n : : +- *(1)
> HashAggregate(keys=[], functions=[partial_avg(c_acctbal#160)],
> output=[sum#175, count#176L])\n : :
> +- *(1) Project [c_acctbal#160]\n : :
> +- *(1) Filter ((isnotnull(c_acctbal#160) AND (c_acctbal#160 > 0.00)) AND
> substring(c_phone#159, 1, 2) IN (13,31,23,29,30,18,17))\n :
> : +- *(1) ColumnarToRow\n :
> : +- FileScan orc
> tpch_data_orc_100.customer[c_phone#159,c_acctbal#160,c_mktsegment#162]
> Batched: true, DataFilters: [isnotnull(c_acctbal#160), (c_acctbal#160 >
> 0.00), substring(c_phone#159, 1, 2) IN (13,31,23,29,3..., Format: ORC,
> Location: CatalogFileIndex(0 paths)[], PartitionFilters: [], PushedFilters:
> [IsNotNull(c_acctbal), GreaterThan(c_acctbal,0.00)], ReadSchema:
> struct<c_phone:string,c_acctbal:decimal(12,2)>\n :
> +- *(1) ColumnarToRow\n : +- FileScan orc
> tpch_data_orc_100.customer[c_custkey#6L,c_phone#10,c_acctbal#11,c_mktsegment#13]
> Batched: true, DataFilters: [isnotnull(c_acctbal#11), substring(c_phone#10,
> 1, 2) IN (13,31,23,29,30,18,17)], Format: ORC, Location: CatalogFileIndex(0
> paths)[], PartitionFilters: [], PushedFilters: [IsNotNull(c_acctbal)],
> ReadSchema: struct<c_custkey:bigint,c_phone:string,c_acctbal:decimal(12,2)>\n
> +- *(4) Sort [o_custkey#17L ASC NULLS FIRST], false, 0\n
> +- Exchange hashpartitioning(o_custkey#17L, 200),
> ENSURE_REQUIREMENTS, [id=#301]\n +- *(3) Project
> [o_custkey#17L]\n +- *(3) ColumnarToRow\n
> +- FileScan orc
> tpch_data_orc_100.orders[o_custkey#17L,o_orderdate#24] Batched: true,
> DataFilters: [], Format: ORC, Location: CatalogFileIndex(0 paths)[],
> PartitionFilters: [], PushedFilters: [], ReadSchema:
> struct<o_custkey:bigint>\n"
> }
> ]
> }
> {code}
> The correct result should be 7 rows returned, not 0 as shown above. This can
> be confirmed by testing the same exact query against the hive table using
> Presto/Hive, which both return 7. Additionally, this link also shows that it
> should be 7
> [https://github.com/apache/impala/blob/master/testdata/workloads/tpch/queries/tpch-q22.test.]
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]