kosiew opened a new pull request, #1246:
URL: https://github.com/apache/datafusion-python/pull/1246
## Which issue does this PR close?
* Closes #1245
## Rationale for this change
`SessionContext.read_table` previously required a `datafusion.catalog.Table`
(the Python `Table` wrapper) and forwarded its `.table` member into the Rust
binding. That meant objects that expose a `__datafusion_table_provider__()` API
returning a PyCapsule (a `TableProvider` exported via the FFI) could not be
passed directly to `read_table` and instead had to be registered in the catalog
first. This added an unnecessary registration round-trip and prevented
ergonomic use of PyCapsule-backed/custom table providers.
This PR makes `read_table` accept either a `datafusion.catalog.Table` or any
Python object that implements `__datafusion_table_provider__()` and returns a
properly-validated PyCapsule. The change removes the need to register a
provider just to obtain a `DataFrame` and unifies the behavior with other
places that already accept PyCapsule-backed providers.
## What changes are included in this PR?
High-level summary of the changes applied across Python and Rust layers:
* **Python documentation**
* `docs/source/user-guide/io/table_provider.rst`: document
`SessionContext.read_table(provider)` usage.
* **Python bindings**
* `python/datafusion/catalog.py`
* Add `Table.__datafusion_table_provider__` to expose the underlying
PyCapsule from the Python `Table` wrapper so it can be treated as a
TableProvider-exportable object by other Python code.
* `python/datafusion/context.py`
* Update `SessionContext.read_table` typing and docstring to accept
either `Table` or a `TableProviderExportable` object (an object implementing
`__datafusion_table_provider__`).
* Adjust internal dispatch so both `Table` instances and provider
objects are supported.
* **Rust core**
* `src/utils.rs`
* Add `foreign_table_provider_from_capsule` and
`try_table_provider_from_object` helpers to centralize validation and
extraction of `FFI_TableProvider` from a PyCapsule and to convert it into an
`Arc<dyn TableProvider>`.
* `src/catalog.rs`
* Use `try_table_provider_from_object` to detect and accept provider
objects that expose `__datafusion_table_provider__` when registering tables
into the catalog.
* Add `PyTable::__datafusion_table_provider__` so `Table` can export an
`FFI_TableProvider` PyCapsule (this is what `python/catalog.py` calls through
the Python layer).
* Simplify and reorganize provider extraction logic inside
`register_table` and schema provider lookup to prefer direct `PyTable`
extraction, then `try_table_provider_from_object`, then fallback to
constructing a `Dataset` as before.
* `src/context.rs`
* Update `PySessionContext::register_table` to accept PyCapsule-backed
provider objects by using `try_table_provider_from_object`.
* Update `PySessionContext::read_table` to accept a generic `PyAny`
bound and detect either `PyTable` (native, avoid FFI round-trip) or any object
that exposes `__datafusion_table_provider__`. Returns an error if neither
condition is met.
* `src/udtf.rs`
* Use `try_table_provider_from_object` when calling Python table
functions so UDTFs that return a provider object via
`__datafusion_table_provider__` are accepted.
* **Tests**
* `python/tests/test_catalog.py`
* Add `test_register_raw_table_without_capsule` to ensure raw `RawTable`
objects can be registered (monkeypatch ensures the capsule path is not
invoked), queried, and deregistered.
* `python/tests/test_context.py`
* Add `test_read_table_accepts_table_provider` to verify
`ctx.read_table(provider)` works when `provider` is a PyCapsule-backed object,
and that `ctx.read_table(table)` still works for regular `Table` objects.
* Minor import cleanup (moved `uuid4` import to module-level where
appropriate).
Other smaller maintenance changes: imports reorganized and some helper
functions added to centralize PyCapsule validation and conversion.
## Are these changes tested?
Yes — new unit tests have been added to validate the new behavior and to
guard against regressions:
* `test_read_table_accepts_table_provider` (in
`python/tests/test_context.py`) exercises reading from a registered provider
and from a provider object directly.
* `test_register_raw_table_without_capsule` (in
`python/tests/test_catalog.py`) verifies raw table registration path does not
trigger the capsule-based extraction and that queries against the registered
table return expected results.
Existing tests were left intact and the new tests exercise both the Python
and Rust-side changes.
## Are there any user-facing changes?
Yes — API behavior and documentation are updated:
* `SessionContext.read_table` now accepts either a
`datafusion.catalog.Table` or any object that implements
`__datafusion_table_provider__()` and returns a `datafusion_table_provider`
PyCapsule. Users can now call `ctx.read_table(provider)` on provider objects
without registering them first.
* New docs in `docs/source/user-guide/io/table_provider.rst` show the
direct-use pattern via `ctx.read_table(provider)`.
This is backwards-compatible: previously-accepted inputs (the Python `Table`
wrapper and `Dataset`-like objects) continue to work.
*No public API breaking changes were made to function signatures on the Rust
side; changes are additive and focus on extending accepted input types and
centralizing provider extraction logic.*
## Notes / Caveats
* The capsule name used and validated is `"datafusion_table_provider"`.
Provider objects must implement `__datafusion_table_provider__()` that returns
a PyCapsule with that name.
* A `PyTable` (the native Python `Table` wrapper) still exposes its provider
via `__datafusion_table_provider__()`; however, the Rust `read_table` path
prefers direct `PyTable` usage to avoid unnecessary FFI round-trips when the
object is already a `RawTable`.
* The `FFI_TableProvider::new(..., Some(runtime))` call means the created
FFI wrapper captures a Tokio runtime handle — ensure that embedding contexts
keep compatible runtimes available.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]