kszucs opened a new pull request, #2561: URL: https://github.com/apache/iceberg-rust/pull/2561
## Which issue does this PR close? <!-- No tracking issue; surfaced while writing Iceberg tables to HuggingFace Hub with content-defined chunking. Happy to file one if preferred. --> - Closes #. ## What changes are included in this PR? `write.parquet.*` table properties were only honored on the DataFusion `INSERT INTO` path, via an inline content-defined-chunking (CDC) translation. Any code writing through the writer stack directly (`DataFileWriter` → `ParquetWriterBuilder`) silently used parquet-rs defaults. - Add `ParquetWriterBuilder::from_table_properties(&TableProperties, schema)`, which translates `write.parquet.*` settings into `WriterProperties`. It currently translates the content-defined-chunking keys (`write.parquet.content-defined-chunking.*`); other keys fall back to parquet-rs defaults and can be added to this single translation point later. - Add a chainable `with_match_mode` setter so the field match mode can be overridden (DataFusion needs name-based matching, since its Arrow batches carry no field-id metadata). - Refactor the DataFusion `insert_into` writer to build via `from_table_properties`, reusing the `TableProperties` it has already parsed instead of translating CDC options inline. Additive only: `new` and `new_with_match_mode` are unchanged; no breaking changes. ## Are these changes tested? - Unit tests in `parquet_writer.rs`: CDC off by default, CDC options translated from properties, and an end-to-end test that writes through the writer to local FS and asserts the `payload` column is split into multiple variable-sized data pages with CDC, and a single page without. - Existing `test_insert_into*` DataFusion integration tests cover the refactored path (behaviorally unchanged). - A new HF-gated integration test (`hf_cdc_write_test`) writes a CDC parquet file to a HuggingFace bucket and verifies content-chunking on read-back, wired into the existing `ci_hf_cdc.yml` workflow. Runs only when `HF_TOKEN`/`HF_BUCKET` are set. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
