Kazuyuki Tanimura created SPARK-39584:
-----------------------------------------
Summary: Fix TPCDSQueryBenchmark Measuring Performance of Wrong
Query Results
Key: SPARK-39584
URL: https://issues.apache.org/jira/browse/SPARK-39584
Project: Spark
Issue Type: Test
Components: Tests
Affects Versions: 3.3.0, 3.2.1, 3.1.2, 3.0.3, 3.4.0
Reporter: Kazuyuki Tanimura
GenTPCDSData uses the schema defined in `TPCDSSchema` that contains
varchar(N)/char(N). When GenTPCDSData generates parquet, that pads spaces for
strings whose lengths are < N.
When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it
uses schema from the parquet file and keeps the paddings. Due to the extra
spaces, string filter queries of TPC-DS fail to match. For example, q13 query
results are all nulls and returns too fast because string filter does not meet
any rows.
Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and
that is inflating some performance results.
I am exploring two possible solutions now
1. Call `{{{}CREATE TABLE tableName schema USING parquet LOCATION path`
{}}}before reading. This is what Spark unit tests are doing
2. Change varchar to string in the schema. This is what [databricks data
generator| [https://github.com/databricks/spark-sql-perf]] is doing
TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in
https://issues.apache.org/jira/browse/SPARK-35192
History related varchar
https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn
--
This message was sent by Atlassian Jira
(v8.20.7#820007)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]