kylebarron commented on issue #1227: URL: https://github.com/apache/datafusion-python/issues/1227#issuecomment-3332072263
I think it'll be easiest to focus on input parameters first. You can either use the `arrow-rs` `pyarrow` feature flag or use my crate [`pyo3-arrow`](https://crates.io/crates/pyo3-arrow). I'm partial to `pyo3-arrow` because of [these issues](https://docs.rs/pyo3-arrow/latest/pyo3_arrow/#why-not-use-arrow-rss-python-integration). So, for example, if we ever have a place where we need a _column input_ then we need to use `pyo3-arrow` (`arrow` doesn't have a way to import a column). > ### Input Parameters > **SessionContext methods:** > > * `from_arrow_table(data: pa.Table)` This can be updated to import a table via the PyCapsule Interface, without any breaking change. On the Python side the type hint can be updated to ```py from typing import Tuple, Protocol class ArrowStreamExportable(Protocol): def __arrow_c_stream__( self, requested_schema: object | None = None ) -> object: ... ``` > * `create_dataframe(partitions: list[list[pa.RecordBatch]])` Similarly, each `RecordBatch` can be imported via pycapsules. > * `register_csv(..., schema: pa.Schema)` > * `register_parquet(..., schema: pa.Schema)` > * `register_json(..., schema: pa.Schema)` These can import a schema via the pycapsule interface, without any breaking change. On the python side the type hint can be updated to [this protocol](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html#protocol-typehints) ```py from typing import Tuple, Protocol class ArrowSchemaExportable(Protocol): def __arrow_c_schema__(self) -> object: ... ``` > * `register_dataset(dataset: pa.dataset.Dataset)` This is specific to a pyarrow API, and so this can stay as-is with pyarrow as an optional dependency (I wish this was named `register_pyarrow_dataset` to avoid confusion) > > **DataFrame methods:** > > * `cast(mapping: dict[str, pa.DataType])` Why isn't this a schema? Or is intentional that you only want to cast a couple specified columns, leaving the others alone, not "projecting" to the specific schema? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
