kazuyukitanimura opened a new pull request, #37096: URL: https://github.com/apache/spark/pull/37096
### What changes were proposed in this pull request? `GenTPCDSData` uses the schema defined in `TPCDSSchema` that contains `char(N)`. When `GenTPCDSData` generates parquet files, that pads spaces for strings whose lengths are `< N`. When `TPCDSQueryBenchmark` reads data from parquet generated by `GenTPCDSData`, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, `q13` query results are all nulls and returns too fast because string filter does not meet any rows. This PR proposes to pass the schema definition to the table creation before reading in order to fix the issue. This is similar to what the Spark TPC-DS unit tests do. In particular, this PR uses `createTable(tableName, source, schema, options)` interface. History related to the `char` issue https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn ### Why are the changes needed? Currently, `TPCDSQueryBenchmark` is benchmarking with wrong query results and that is showing inaccurate performance results. Therefore, the `Per Row(ns)` column of `TPCDSQueryBenchmark-results.txt` is `Infinity`. With this PR, the column now shows real numbers. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tested on Github Actions. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
