kosiew opened a new pull request, #1246:
URL: https://github.com/apache/datafusion-python/pull/1246

   ## Which issue does this PR close?
   
   * Closes #1245
   
   ## Rationale for this change
   
   `SessionContext.read_table` previously required a `datafusion.catalog.Table` 
(the Python `Table` wrapper) and forwarded its `.table` member into the Rust 
binding. That meant objects that expose a `__datafusion_table_provider__()` API 
returning a PyCapsule (a `TableProvider` exported via the FFI) could not be 
passed directly to `read_table` and instead had to be registered in the catalog 
first. This added an unnecessary registration round-trip and prevented 
ergonomic use of PyCapsule-backed/custom table providers.
   
   This PR makes `read_table` accept either a `datafusion.catalog.Table` or any 
Python object that implements `__datafusion_table_provider__()` and returns a 
properly-validated PyCapsule. The change removes the need to register a 
provider just to obtain a `DataFrame` and unifies the behavior with other 
places that already accept PyCapsule-backed providers.
   
   ## What changes are included in this PR?
   
   High-level summary of the changes applied across Python and Rust layers:
   
   * **Python documentation**
   
     * `docs/source/user-guide/io/table_provider.rst`: document 
`SessionContext.read_table(provider)` usage.
   
   * **Python bindings**
   
     * `python/datafusion/catalog.py`
   
       * Add `Table.__datafusion_table_provider__` to expose the underlying 
PyCapsule from the Python `Table` wrapper so it can be treated as a 
TableProvider-exportable object by other Python code.
     * `python/datafusion/context.py`
   
       * Update `SessionContext.read_table` typing and docstring to accept 
either `Table` or a `TableProviderExportable` object (an object implementing 
`__datafusion_table_provider__`).
       * Adjust internal dispatch so both `Table` instances and provider 
objects are supported.
   
   * **Rust core**
   
     * `src/utils.rs`
   
       * Add `foreign_table_provider_from_capsule` and 
`try_table_provider_from_object` helpers to centralize validation and 
extraction of `FFI_TableProvider` from a PyCapsule and to convert it into an 
`Arc<dyn TableProvider>`.
     * `src/catalog.rs`
   
       * Use `try_table_provider_from_object` to detect and accept provider 
objects that expose `__datafusion_table_provider__` when registering tables 
into the catalog.
       * Add `PyTable::__datafusion_table_provider__` so `Table` can export an 
`FFI_TableProvider` PyCapsule (this is what `python/catalog.py` calls through 
the Python layer).
       * Simplify and reorganize provider extraction logic inside 
`register_table` and schema provider lookup to prefer direct `PyTable` 
extraction, then `try_table_provider_from_object`, then fallback to 
constructing a `Dataset` as before.
     * `src/context.rs`
   
       * Update `PySessionContext::register_table` to accept PyCapsule-backed 
provider objects by using `try_table_provider_from_object`.
       * Update `PySessionContext::read_table` to accept a generic `PyAny` 
bound and detect either `PyTable` (native, avoid FFI round-trip) or any object 
that exposes `__datafusion_table_provider__`. Returns an error if neither 
condition is met.
     * `src/udtf.rs`
   
       * Use `try_table_provider_from_object` when calling Python table 
functions so UDTFs that return a provider object via 
`__datafusion_table_provider__` are accepted.
   
   * **Tests**
   
     * `python/tests/test_catalog.py`
   
       * Add `test_register_raw_table_without_capsule` to ensure raw `RawTable` 
objects can be registered (monkeypatch ensures the capsule path is not 
invoked), queried, and deregistered.
     * `python/tests/test_context.py`
   
       * Add `test_read_table_accepts_table_provider` to verify 
`ctx.read_table(provider)` works when `provider` is a PyCapsule-backed object, 
and that `ctx.read_table(table)` still works for regular `Table` objects.
       * Minor import cleanup (moved `uuid4` import to module-level where 
appropriate).
   
   Other smaller maintenance changes: imports reorganized and some helper 
functions added to centralize PyCapsule validation and conversion.
   
   ## Are these changes tested?
   
   Yes — new unit tests have been added to validate the new behavior and to 
guard against regressions:
   
   * `test_read_table_accepts_table_provider` (in 
`python/tests/test_context.py`) exercises reading from a registered provider 
and from a provider object directly.
   * `test_register_raw_table_without_capsule` (in 
`python/tests/test_catalog.py`) verifies raw table registration path does not 
trigger the capsule-based extraction and that queries against the registered 
table return expected results.
   
   Existing tests were left intact and the new tests exercise both the Python 
and Rust-side changes.
   
   ## Are there any user-facing changes?
   
   Yes — API behavior and documentation are updated:
   
   * `SessionContext.read_table` now accepts either a 
`datafusion.catalog.Table` or any object that implements 
`__datafusion_table_provider__()` and returns a `datafusion_table_provider` 
PyCapsule. Users can now call `ctx.read_table(provider)` on provider objects 
without registering them first.
   * New docs in `docs/source/user-guide/io/table_provider.rst` show the 
direct-use pattern via `ctx.read_table(provider)`.
   
   This is backwards-compatible: previously-accepted inputs (the Python `Table` 
wrapper and `Dataset`-like objects) continue to work.
   
   *No public API breaking changes were made to function signatures on the Rust 
side; changes are additive and focus on extending accepted input types and 
centralizing provider extraction logic.*
   
   ## Notes / Caveats
   
   * The capsule name used and validated is `"datafusion_table_provider"`. 
Provider objects must implement `__datafusion_table_provider__()` that returns 
a PyCapsule with that name.
   * A `PyTable` (the native Python `Table` wrapper) still exposes its provider 
via `__datafusion_table_provider__()`; however, the Rust `read_table` path 
prefers direct `PyTable` usage to avoid unnecessary FFI round-trips when the 
object is already a `RawTable`.
   * The `FFI_TableProvider::new(..., Some(runtime))` call means the created 
FFI wrapper captures a Tokio runtime handle — ensure that embedding contexts 
keep compatible runtimes available.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to