kszucs opened a new pull request, #22632:
URL: https://github.com/apache/datafusion/pull/22632

   ## Which issue does this PR close?
   
   - N/A — this refines the parquet content-defined chunking (CDC) write 
options added in #21110, which have not been released yet (slated for 54.0.0), 
before they ship. Happy to file a tracking issue if preferred.
   
   ## Rationale for this change
   
   The CDC options currently work as `use_content_defined_chunking: 
Option<CdcOptions>` with a bespoke `ConfigField` impl that accepts a bare 
`use_content_defined_chunking = true|false` and otherwise enables CDC 
implicitly when any sub-field is set. This has a few problems:
   
   - **Naming diverges from parquet-rs.** `WriterProperties` exposes 
`content_defined_chunking()` / 
`set_content_defined_chunking(Option<CdcOptions>)` with no `use_` prefix.
   - **Implicit / order-dependent on the SQL side.** Format options in `COPY 
... OPTIONS` / `CREATE EXTERNAL TABLE ... OPTIONS` are applied from a `HashMap` 
(non-deterministic order). With the old bare-boolean form, mixing `... = false` 
with a sub-field, or setting a sub-field after `= false`, could resolve to 
enabled or disabled depending on iteration order.
   - **Extra machinery.** Supporting the bare boolean required a hand-written 
`impl ConfigField for CdcOptions` + `impl ConfigField for Option<CdcOptions>` 
and a `#[expect(clippy::should_implement_trait)]` workaround, plus a 
zero-sentinel fallback in the proto mapping.
   
   Since CDC is unreleased, the config/proto surface can still be changed 
freely.
   
   ## What changes are included in this PR?
   
   - Rename the `ParquetOptions` field `use_content_defined_chunking` -> 
`content_defined_chunking` (matches parquet-rs).
   - Make `CdcOptions` a plain `config_namespace!` with an explicit `enabled: 
bool` field alongside the chunking parameters; the field is a bare `CdcOptions` 
(no longer `Option<CdcOptions>`). CDC is on iff 
`content_defined_chunking.enabled` is true. Setting a parameter no longer 
implicitly enables CDC, and the result is independent of key order.
   - Add `CdcOptions::enabled()` / `CdcOptions::disabled()` shorthand 
constructors.
   - Drop the bespoke `ConfigField` impls and the `should_implement_trait` 
workaround — all generated by the macro now.
   - Add an `enabled` field to the proto `CdcOptions` message so the proto <-> 
config mapping is a plain field copy in both directions (removes the 
presence-encoding and the zero-sentinel fallback).
   - Update unit tests, regenerate config docs + the `information_schema` 
snapshot, and add `parquet_cdc_config.slt` documenting the resolution behavior.
   
   ## Are these changes tested?
   
   Yes:
   - `datafusion-common` config + writer unit tests (enable toggle, 
parameter-does-not-enable, validation, writer round-trip).
   - `datafusion-proto-common` proto round-trip tests (enabled / disabled / 
negative norm level).
   - `datafusion/core` parquet integration tests (data round-trip, page 
boundaries).
   - sqllogictest: `parquet_cdc.slt` (end-to-end) and a new 
`parquet_cdc_config.slt` (config resolution / order independence).
   
   ## Are there any user-facing changes?
   
   Yes, but only to the unreleased CDC options:
   - Config key `datafusion.execution.parquet.use_content_defined_chunking` -> 
`datafusion.execution.parquet.content_defined_chunking.enabled` (plus 
`.min_chunk_size` / `.max_chunk_size` / `.norm_level`).
   - The bare-boolean form is removed; enable/disable via 
`content_defined_chunking.enabled = true|false`.
   
   No released API is affected.
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to