nssalian opened a new issue, #3100: URL: https://github.com/apache/iceberg-python/issues/3100
### Feature Request / Improvement ### Feature Request / Improvement ## Problem The write path in `pyiceberg/io/pyarrow.py` is hardcoded to Parquet. The `write.format.default` table property exists but is never read. Adding a new format (ORC, Vortex, Lance) requires modifying the monolithic `write_file()` function. The read path already dispatches multiple formats; the write path should too. ## Proposal Introduce a File Format API aligned with Java Iceberg's [File Format API](https://github.com/apache/iceberg/pull/12774) ([design doc](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds)). New module `pyiceberg/io/fileformat.py`: - `FileFormatWriter` (ABC) - `FileFormatModel` (ABC) - `FormatRegistry` - `DataFileStatistics` (it's in `pyarrow.py` currently but I think this might be good to consolidate for metrics) Changes to `pyiceberg/io/pyarrow.py`: - `ParquetFormatWriter` / `ParquetFormatModel` using the `write_parquet()` (inside `write_file()` - `write_file()` refactored to read `write.format.default`, look up the format model, and dispatch. TCK `tests/io/test_file_format_tck.py`: - pytest-parameterized round-trip, statistics, type coverage, and null handling tests for every registered format. Phased rollout: - ABCs and registry first, then Parquet extraction with TCK tests, then `write_file()` dispatch ## Java ↔ Python Mapping | Java | Python | |---|---| | `FormatModel<D, S>` | `FileFormatModel` (ABC, no type params) | | `FileAppender<D>` / `ModelWriteBuilder` | `FileFormatWriter` (ABC) | | `FormatModelRegistry` | `FormatRegistry` (keyed by `FileFormat` only) | | `Metrics` | `DataFileStatistics` (existing) | | TCK | `test_file_format_tck.py` | ## Scope This proposal covers the abstraction layer and the Parquet extraction only. No new format writers are included; ORC write support ([#20](https://github.com/apache/iceberg-python/issues/20)) and any future formats (Avro, etc.) would be follow-ups once this lands. ## References - Java File Format API: [apache/iceberg#12774](https://github.com/apache/iceberg/pull/12774) - Design doc: [Google Doc](https://docs.google.com/document/d/1sF_d4tFxJsZWsZFCyCL9ZE7YuI7-P3VrzMLIrrTIxds) - Format impls: Parquet [#15253](https://github.com/apache/iceberg/pull/15253), ORC [#15255](https://github.com/apache/iceberg/pull/15255), Avro [#15254](https://github.com/apache/iceberg/pull/15254) - TCK: [apache/iceberg#15415](https://github.com/apache/iceberg/issues/15415) - Prior pyiceberg ORC work: [#20](https://github.com/apache/iceberg-python/issues/20), [#790](https://github.com/apache/iceberg-python/pull/790), [#2236](https://github.com/apache/iceberg-python/pull/2236) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
