abnobdoss opened a new pull request, #3447: URL: https://github.com/apache/iceberg-python/pull/3447
<!-- Closes #2680 --> <!-- Closes #1655 --> Closes #2680 Closes #1655 # Rationale for this change PyIceberg is coupled to PyArrow at its read/write boundary: `append` / `overwrite` reject anything that isn't a `pa.Table` / `pa.RecordBatchReader`, and external Arrow consumers can't read a table/scan without `to_arrow()`. Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore have to convert to PyArrow explicitly. This PR adopts the [Arrow PyCapsule interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html) on both sides: - **Input (#2680):** `append` / `overwrite` accept any object implementing `__arrow_c_stream__`, in addition to PyArrow types. - **Output (#1655):** `Table` and `DataScan` implement `__arrow_c_stream__`, so they can be handed to any Arrow consumer. ```python import polars as pl df = pl.DataFrame(table.scan()) # read: a scan is an Arrow producer table.append(some_polars_frame) # write: a polars/arro3/… frame is too ``` Native PyArrow inputs are unchanged; any other producer is imported as a streaming `RecordBatchReader`, so streaming is preserved. PyArrow stays an internal write dependency; this only removes the requirement that the *caller* use PyArrow. One small writer-side adjustment falls out of this: bin-packing still prefers Arrow's logical `nbytes` estimate, but falls back to referenced buffer size for Arrow view types like `string_view`, which current Polars exports can produce and PyArrow cannot always size with `nbytes`. **Not in scope:** `upsert` / `dynamic_partition_overwrite` still require a materialized `pa.Table` (they do random access / joins and don't accept a `RecordBatchReader` today). A PyCapsule producer to `append` / `overwrite` on a **partitioned** table raises `NotImplementedError`, the same restriction that already applies to `pa.RecordBatchReader`, since the producer is consumed as a reader and streaming writes to partitioned tables aren't supported. A materialized `pa.Table` is unaffected. ## Are these changes tested? Yes. `tests/table/test_arrow_capsule.py` (runs under `make test`, no Docker) covers coercion-helper branches; `append` over all input forms (`pa.Table`, reader, single- and multi-batch PyCapsule producers); `overwrite` with a producer; the native-`pa.Table`-on-partitioned regression; `pa.table(table)` / `pa.table(table.scan())` round-trips plus filter/projection; and the `dst.append(src.scan())` round-trip. `tests/io/test_pyarrow.py` covers the `string_view` bin-packing fallback. ## Are there any user-facing changes? Yes, additive and backwards compatible. `append` / `overwrite` accept Arrow PyCapsule producers (`__arrow_c_stream__`), and `Table` / `DataScan` implement `__arrow_c_stream__` so they can be passed to any Arrow consumer (e.g. `pa.table(...)`, `polars.DataFrame(...)`). No change for existing PyArrow inputs. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
