[PR] feat: support arrow pycapsule streams [iceberg-python]

via GitHub Sun, 31 May 2026 13:47:48 -0700


abnobdoss opened a new pull request, #3447:
URL: https://github.com/apache/iceberg-python/pull/3447


   <!-- Closes #2680 -->
   <!-- Closes #1655 -->
   
   Closes #2680
   Closes #1655
   
   # Rationale for this change
   
   PyIceberg is coupled to PyArrow at its read/write boundary: `append` / 
`overwrite` reject anything that isn't a `pa.Table` / `pa.RecordBatchReader`, 
and external Arrow consumers can't read a table/scan without `to_arrow()`. 
Users of other Arrow-native libraries (polars, arro3, nanoarrow, …) therefore 
have to convert to PyArrow explicitly.
   
   This PR adopts the [Arrow PyCapsule 
interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html)
 on both sides:
   
   - **Input (#2680):** `append` / `overwrite` accept any object implementing 
`__arrow_c_stream__`, in addition to PyArrow types.
   - **Output (#1655):** `Table` and `DataScan` implement `__arrow_c_stream__`, 
so they can be handed to any Arrow consumer.
   
   ```python
   import polars as pl
   
   df = pl.DataFrame(table.scan())     # read: a scan is an Arrow producer
   table.append(some_polars_frame)     # write: a polars/arro3/… frame is too
   ```
   
   Native PyArrow inputs are unchanged; any other producer is imported as a 
streaming `RecordBatchReader`, so streaming is preserved. PyArrow stays an 
internal write dependency; this only removes the requirement that the *caller* 
use PyArrow.
   
   One small writer-side adjustment falls out of this: bin-packing still 
prefers Arrow's logical `nbytes` estimate, but falls back to referenced buffer 
size for Arrow view types like `string_view`, which current Polars exports can 
produce and PyArrow cannot always size with `nbytes`.
   
   **Not in scope:** `upsert` / `dynamic_partition_overwrite` still require a 
materialized `pa.Table` (they do random access / joins and don't accept a 
`RecordBatchReader` today). A PyCapsule producer to `append` / `overwrite` on a 
**partitioned** table raises `NotImplementedError`, the same restriction that 
already applies to `pa.RecordBatchReader`, since the producer is consumed as a 
reader and streaming writes to partitioned tables aren't supported. A 
materialized `pa.Table` is unaffected.
   
   ## Are these changes tested?
   
   Yes. `tests/table/test_arrow_capsule.py` (runs under `make test`, no Docker) 
covers coercion-helper branches; `append` over all input forms (`pa.Table`, 
reader, single- and multi-batch PyCapsule producers); `overwrite` with a 
producer; the native-`pa.Table`-on-partitioned regression; `pa.table(table)` / 
`pa.table(table.scan())` round-trips plus filter/projection; and the 
`dst.append(src.scan())` round-trip. `tests/io/test_pyarrow.py` covers the 
`string_view` bin-packing fallback.
   
   ## Are there any user-facing changes?
   
   Yes, additive and backwards compatible. `append` / `overwrite` accept Arrow 
PyCapsule producers (`__arrow_c_stream__`), and `Table` / `DataScan` implement 
`__arrow_c_stream__` so they can be passed to any Arrow consumer (e.g. 
`pa.table(...)`, `polars.DataFrame(...)`). No change for existing PyArrow 
inputs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: support arrow pycapsule streams [iceberg-python]

Reply via email to