kszucs opened a new pull request, #2561:
URL: https://github.com/apache/iceberg-rust/pull/2561

   ## Which issue does this PR close?
   
   <!-- No tracking issue; surfaced while writing Iceberg tables to HuggingFace 
Hub with content-defined chunking. Happy to file one if preferred. -->
   
   - Closes #.
   
   ## What changes are included in this PR?
   
   `write.parquet.*` table properties were only honored on the DataFusion 
`INSERT INTO` path, via an inline content-defined-chunking (CDC) translation. 
Any code writing through the writer stack directly (`DataFileWriter` → 
`ParquetWriterBuilder`) silently used parquet-rs defaults.
   
   - Add `ParquetWriterBuilder::from_table_properties(&TableProperties, 
schema)`, which translates `write.parquet.*` settings into `WriterProperties`. 
It currently translates the content-defined-chunking keys 
(`write.parquet.content-defined-chunking.*`); other keys fall back to 
parquet-rs defaults and can be added to this single translation point later.
   - Add a chainable `with_match_mode` setter so the field match mode can be 
overridden (DataFusion needs name-based matching, since its Arrow batches carry 
no field-id metadata).
   - Refactor the DataFusion `insert_into` writer to build via 
`from_table_properties`, reusing the `TableProperties` it has already parsed 
instead of translating CDC options inline.
   
   Additive only: `new` and `new_with_match_mode` are unchanged; no breaking 
changes.
   
   ## Are these changes tested?
   
   - Unit tests in `parquet_writer.rs`: CDC off by default, CDC options 
translated from properties, and an end-to-end test that writes through the 
writer to local FS and asserts the `payload` column is split into multiple 
variable-sized data pages with CDC, and a single page without.
   - Existing `test_insert_into*` DataFusion integration tests cover the 
refactored path (behaviorally unchanged).
   - A new HF-gated integration test (`hf_cdc_write_test`) writes a CDC parquet 
file to a HuggingFace bucket and verifies content-chunking on read-back, wired 
into the existing `ci_hf_cdc.yml` workflow. Runs only when 
`HF_TOKEN`/`HF_BUCKET` are set.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to