Re: [PR] benchmark: Add parquet h2o support [datafusion]

via GitHub Wed, 16 Jul 2025 22:34:30 -0700


zhuqi-lucas commented on PR #16804:
URL: https://github.com/apache/datafusion/pull/16804#issuecomment-3082603389


   Updated, it works now, the falsa has merged the fix and released: 
https://github.com/mrpowers-io/falsa/pull/28
   
   ```rust
   ./bench.sh data h2o_small_join_parquet
   ***************************
   DataFusion Benchmark Runner and Data Generator
   COMMAND: data
   BENCHMARK: h2o_small_join_parquet
   DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
   CARGO_COMMAND: cargo run --release
   PREFER_HASH_JOIN: true
   ***************************
   Found Python version 3.13, which is suitable.
   Using Python command: /opt/homebrew/bin/python3
   Installing falsa...
   Generating h2o test data in 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and 
format=PARQUET
   10 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e1_0.parquet
   
   10000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e4_0.parquet
   
   10000000 rows will be saved into: 
/Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e7_NA.parquet
   
   An SMALL data schema is the following:
   id1: int64 not null
   id4: string not null
   v2: double not null
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
   
   An MEDIUM data schema is the following:
   id1: int64 not null
   id2: int64 not null
   id4: string not null
   id5: string not null
   v2: double not null
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
   
   An BIG data schema is the following:
   id1: int64 not null
   id2: int64 not null
   id3: int64 not null
   id4: string not null
   id5: string not null
   id6: string not null
   v2: double not null
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:02
   
   An LSH data schema is the following:
   id1: int64 not null
   id2: int64 not null
   id3: int64 not null
   id4: string not null
   id5: string not null
   id6: string not null
   v1: double not null
   
   An output format is PARQUET
   
   Batch mode is supported.
   In case of memory problems you can try to reduce a batch_size.
   
   
   Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] benchmark: Add parquet h2o support [datafusion]

Reply via email to