Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) [datafusion]

via GitHub Sun, 05 Jan 2025 19:48:12 -0800


zhuqi-lucas commented on PR #13996:
URL: https://github.com/apache/datafusion/pull/13996#issuecomment-2572223582


   > Thank you, I have tried and there is an issue generating data, everything 
else looks good to me.
   > 
   > When I run `./bench.sh data h2o_medum` with python 3.13
   > 
   > ```
   > ...
   >    error: the configured Python interpreter version (3.13) is newer than 
PyO3's maximum supported version (3.12)
   >         = help: please check if an updated version of PyO3 is available. 
Current version: 0.20.3
   >         = help: set PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1 to suppress this 
check and build anyway using the stable ABI
   >       warning: build failed, waiting for other jobs to finish...
   >       💥 maturin failed
   > ...
   > ```
   > 
   > The error showed up, I think `falsa` does not support python 3.13. Perhaps 
we can enforce [email protected] to suppress this issue now? In the future maybe we 
can use a docker image to generate h2o dataset instead.
   
   Thank you @2010YOUY01 for review, i fix the issue, now python 3.13 is also 
supported by testing:
   
   ```rust
   ./benchmarks/bench.sh data h2o_small
   ***************************
   DataFusion Benchmark Runner and Data Generator
   COMMAND: data
   BENCHMARK: h2o_small
   DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   Found Python version 3.13, which is suitable.
   Using Python command: /usr/local/bin/python3
   Installing falsa...
   Generating h2o test data in 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and 
format=PARQUET
   10000000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/G1_1e7_1e7_100_0.parquet
   
   An output data schema is the following:
   id1: string
   id2: string
   id3: string
   id4: int64
   id5: int64
   id6: int64
   v1: int64 not null
   v2: int64 not null
   v3: double not null
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:04
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add H2O.ai Database-like Ops benchmark to dfbench (groupby support) [datafusion]

Reply via email to