kazuyukitanimura opened a new pull request, #37096:
URL: https://github.com/apache/spark/pull/37096

   ### What changes were proposed in this pull request?
   `GenTPCDSData` uses the schema defined in `TPCDSSchema` that contains 
`char(N)`. When `GenTPCDSData` generates parquet files, that pads spaces for 
strings whose lengths are `< N`.
   
   When `TPCDSQueryBenchmark` reads data from parquet generated by 
`GenTPCDSData`, it uses schema from the parquet file and keeps the paddings. 
Due to the extra spaces, string filter queries of TPC-DS fail to match. For 
example, `q13` query results are all nulls and returns too fast because string 
filter does not meet any rows.
   
   This PR proposes to pass the schema definition to the table creation before 
reading in order to fix the issue. This is similar to what the Spark TPC-DS 
unit tests do. In particular, this PR uses `createTable(tableName, source, 
schema, options)` interface.
   
   History related to the `char` issue 
https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn
   
   
   ### Why are the changes needed?
   Currently, `TPCDSQueryBenchmark` is benchmarking with wrong query results 
and that is showing inaccurate performance results. Therefore, the `Per 
Row(ns)` column of `TPCDSQueryBenchmark-results.txt` is `Infinity`. With this 
PR, the column now shows real numbers.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Tested on Github Actions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to