[PR] fix: generate integer keys instead of floats in TPC-DS data [datafusion-benchmarks]

via GitHub Thu, 09 Apr 2026 07:14:36 -0700


Dandandan opened a new pull request, #31:
URL: https://github.com/apache/datafusion-benchmarks/pull/31


   ## Summary
   
   - Fix `tpcdsgen.py` to cast DataFrame to the correct PyArrow target schema 
before writing Parquet, ensuring nullable integer columns (surrogate keys, 
quantities) are stored as `int32` instead of `double`/`float64`
   - Fix trailing pipe detection to work with both old and new dsdgen versions 
(v4.0.0 no longer adds a trailing `|` as field terminator)
   - Regenerate all SF1 parquet data with dsdgen v4.0.0 and zstd compression
   
   ## Background
   
   The previous data had ~100 columns across 15 tables stored as `double` that 
should be `int32` (e.g. `ss_sold_date_sk`, `cs_bill_customer_sk`, 
`ss_quantity`). This happened because DataFusion's CSV reader promotes nullable 
`int32` columns to `float64` when there are null values. The fix collects the 
result into a PyArrow table and casts to the declared schema before writing.
   
   ## Test plan
   
   - [x] Verified all 24 generated parquet files have zero `double` columns
   - [x] Verified null counts match between old and new data
   - [x] All files under GitHub's 100MB file size limit
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] fix: generate integer keys instead of floats in TPC-DS data [datafusion-benchmarks]

Reply via email to