Smotrov opened a new pull request, #18954:
URL: https://github.com/apache/datafusion/pull/18954
## Which issue does this PR close?
Closes #18947
## Rationale for this change
Currently, DataFusion uses default compression levels when writing
compressed JSON and CSV files. For ZSTD, this means level 3, which prioritizes
speed over compression ratio. Users working with large datasets who want to
optimize for storage costs or network transfer have no way to increase the
compression level.
This is particularly important for cloud data lake scenarios where storage
and egress costs can be significant.
## What changes are included in this PR?
- Add `compression_level: Option<u32>` field to `JsonOptions` and
`CsvOptions` in `config.rs`
- Add `convert_async_writer_with_level()` method to `FileCompressionType`
(non-breaking API extension)
- Keep original `convert_async_writer()` as a convenience wrapper for
backward compatibility
- Update `JsonWriterOptions` and `CsvWriterOptions` with `compression_level`
field
- Update `ObjectWriterBuilder` to support compression level
- Update JSON and CSV sinks to pass compression level through the write
pipeline
- Update proto definitions and conversions for serialization support
- Fix unrelated unused import warning in `udf.rs` (conditional compilation
for debug-only imports)
## Are these changes tested?
The changes follow the existing patterns used throughout the codebase. The
implementation was verified by:
- Building successfully with `cargo build`
- Running existing tests with `cargo test --package datafusion-proto`
- All 131 proto integration tests pass
## Are there any user-facing changes?
Yes, users can now specify compression level when writing JSON/CSV files:
```rust
use datafusion::common::config::JsonOptions;
use datafusion::common::parsers::CompressionTypeVariant;
let json_opts = JsonOptions {
compression: CompressionTypeVariant::ZSTD,
compression_level: Some(9), // Higher compression
..Default::default()
};
```
**Supported compression levels:**
- ZSTD: 1-22 (default: 3)
- GZIP: 0-9 (default: 6)
- BZIP2: 1-9 (default: 9)
- XZ: 0-9 (default: 6)
**This is a non-breaking change** - the original `convert_async_writer()`
method signature is
## Are these changes testedatibility.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]