kszucs opened a new pull request, #22648:
URL: https://github.com/apache/datafusion/pull/22648

   ## Which issue does this PR close?
   
   - Backport of #22632 to `branch-54`.
   
   ## Rationale for this change
   
   Content-defined chunking (CDC) write options were added in #21110 and are 
slated for the 54.0.0 release. This backports the refactor in #22632 so the 
config/proto surface ships in its final form, before the release goes out.
   
   The CDC options previously worked as `use_content_defined_chunking: 
Option<CdcOptions>` with a `ConfigField` impl that accepted a bare 
`use_content_defined_chunking = true|false` and otherwise enabled CDC 
implicitly when any sub-field was set. This has a few problems:
   
   - **Naming diverges from parquet-rs.** `WriterProperties` exposes 
`content_defined_chunking()` / 
`set_content_defined_chunking(Option<CdcOptions>)` with no `use_` prefix.
   - **Implicit / order-dependent on the SQL side.** Format options in `COPY 
... OPTIONS` / `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap` 
(non-deterministic order). With the old bare-boolean form, mixing `... = false` 
with a sub-field could resolve to enabled or disabled depending on iteration 
order.
   - **Extra machinery.** Supporting the bare boolean required hand-written 
`ConfigField` impls and a `#[expect(clippy::should_implement_trait)]` 
workaround, plus a zero-sentinel fallback in the proto mapping.
   
   Since CDC is unreleased, the config/proto surface can still be changed 
freely.
   
   ## What changes are included in this PR?
   
   - Rename the `ParquetOptions` field `use_content_defined_chunking` -> 
`content_defined_chunking` (matches parquet-rs).
   - Make `CdcOptions` a plain `config_namespace!` with an explicit `enabled: 
bool` field alongside the chunking parameters; the field is a bare `CdcOptions` 
(no longer `Option<CdcOptions>`). CDC is on iff 
`content_defined_chunking.enabled` is true. Setting a parameter no longer 
implicitly enables CDC, and the result is independent of key order.
   - Add `CdcOptions::enabled()` / `CdcOptions::disabled()` shorthand 
constructors.
   - Drop the `ConfigField` impls and the `should_implement_trait` workaround — 
all generated by the macro now.
   - Add an `enabled` field to the proto `CdcOptions` message so the proto <-> 
config mapping is a plain field copy in both directions.
   - Update unit tests, regenerate config docs + the `information_schema` 
snapshot, and add `parquet_cdc_config.slt` documenting the resolution behavior.
   
   ## Are these changes tested?
   
   Yes — `datafusion-common` config + writer unit tests, 
`datafusion-proto-common` proto round-trip tests, `datafusion/core` parquet 
integration tests, and sqllogictest (`parquet_cdc.slt` + new 
`parquet_cdc_config.slt`). Cherry-pick applied cleanly onto `branch-54`; 
affected crates build and the CDC unit tests pass.
   
   ## Are there any user-facing changes?
   
   Yes, but only to the unreleased CDC options:
   - Config key `datafusion.execution.parquet.use_content_defined_chunking` -> 
`datafusion.execution.parquet.content_defined_chunking.enabled` (plus 
`.min_chunk_size` / `.max_chunk_size` / `.norm_level`).
   - The bare-boolean form is removed; enable/disable via 
`content_defined_chunking.enabled = true|false`.
   
   No released API is affected.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to