Re: [I] Remove pyarrow as required dependency, relying on Arrow PyCapsule Interface [datafusion-python]

via GitHub Wed, 24 Sep 2025 21:38:19 -0700


kylebarron commented on issue #1227:
URL: 
https://github.com/apache/datafusion-python/issues/1227#issuecomment-3332072263


   I think it'll be easiest to focus on input parameters first. You can either 
use the `arrow-rs` `pyarrow` feature flag or use my crate 
[`pyo3-arrow`](https://crates.io/crates/pyo3-arrow). I'm partial to 
`pyo3-arrow` because of [these 
issues](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration).
 So, for example, if we ever have a place where we need a _column input_ then 
we need to use `pyo3-arrow` (`arrow` doesn't have a way to import a column).
   
   > ### Input Parameters
   > **SessionContext methods:**
   > 
   > * `from_arrow_table(data: pa.Table)`
   
   This can be updated to import a table via the PyCapsule Interface, without 
any breaking change. On the Python side the type hint can be updated to   
   ```py
   from typing import Tuple, Protocol
   
   class ArrowStreamExportable(Protocol):
       def __arrow_c_stream__(
           self,
           requested_schema: object | None = None
       ) -> object:
           ...
   ```
   
   > * `create_dataframe(partitions: list[list[pa.RecordBatch]])`
   
   Similarly, each `RecordBatch` can be imported via pycapsules. 
   
   > * `register_csv(..., schema: pa.Schema)`
   > * `register_parquet(..., schema: pa.Schema)`
   > * `register_json(..., schema: pa.Schema)`
   
   These can import a schema via the pycapsule interface, without any breaking 
change. On the python side the type hint can be updated to [this 
protocol](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints)
   
   ```py
   from typing import Tuple, Protocol
   
   class ArrowSchemaExportable(Protocol):
       def __arrow_c_schema__(self) -> object: ...
   ```
   
   
   > * `register_dataset(dataset: pa.dataset.Dataset)`
   
   This is specific to a pyarrow API, and so this can stay as-is with pyarrow 
as an optional dependency (I wish this was named `register_pyarrow_dataset` to 
avoid confusion)
   
   > 
   > **DataFrame methods:**
   > 
   > * `cast(mapping: dict[str, pa.DataType])`
   
   Why isn't this a schema? Or is intentional that you only want to cast a 
couple specified columns, leaving the others alone, not "projecting" to the 
specific schema?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Remove pyarrow as required dependency, relying on Arrow PyCapsule Interface [datafusion-python]

Reply via email to