diegoQuinas commented on issue #21446: URL: https://github.com/apache/datafusion/issues/21446#issuecomment-4275049309
Hi @ariel-miculas — happy to pick this up if you're not planning to finish your `json-test-on-main` branch. Just let me know either way. A couple of bugs I noticed in that branch, worth flagging regardless of who continues: - `data_clickbench_2` wgets `hits.json` but the S3 path only serves `hits.json.gz` (the `.json` URL returns 404). The real file is ~24 GB gzipped. - The Rust side uses `JsonReadOptions::default()` with no gzip configuration, so even with a corrected URL it couldn't decode the compressed stream. My plan if I take it: - Keep the `--format parquet|json` flag approach from your branch (consistent with `tpch`). - Name the `bench.sh` variant `clickbench_json` (matches the `_partitioned` / `_sorted` style). - Read `hits.json.gz` directly via `NdJsonReadOptions::default().file_compression_type(GZIP)` — avoids needing ~100 GB of free disk for decompression. - Exclude it from `data all` since it's the same rows as `clickbench_1` in a different format (same call `clickbench_sorted` already makes). Open to adjustments from @alamb or Ariel before I send a PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
