kosiew opened a new pull request, #1247:
URL: https://github.com/apache/datafusion-python/pull/1247
## Which issue does this PR close?
* Closes #1244
## Rationale for this change
Users currently must explicitly register any in-memory Arrow/Pandas/Polars
tables before running SQL. This makes quick exploratory workflows (where users
create a DataFrame in Python and immediately query it with SQL) awkward because
the user must call `from_pandas` / `from_arrow` or similar register helpers.
The change implements an *opt-in* replacement-scan style lookup that inspects
Python scopes to find variables whose names match missing table identifiers and
automatically registers them if they expose Arrow-compatible data.
The behaviour is safe-by-default (disabled), but can be enabled either at
session construction time or via `SessionConfig`. It improves ergonomics for
REPL/notebook workflows while preserving existing semantics for applications
that require explicit registration.
## What changes are included in this PR?
Summary of functional changes:
* **Python API**
* Added `SessionConfig.with_python_table_lookup(enabled: bool = True)` to
configure default behaviour.
* `SessionContext` constructor accepts `auto_register_python_objects: bool
| None` to opt into automatic lookup at construction time. If omitted, it uses
the `SessionConfig` setting (default `False`).
* Added `SessionContext.set_python_table_lookup(enabled: bool = True)` to
toggle behaviour at runtime.
* `SessionContext.sql(...)` will, when the feature is enabled, attempt to
introspect missing table names from DataFusion errors, look up variables in the
calling Python stack, and automatically register matching objects: DataFusion
`DataFrame` views, Polars DataFrame, pandas DataFrame, and Arrow `Table` /
`RecordBatch` / `RecordBatchReader` or objects exposing Arrow C data
interfaces. Registration uses the existing `from_pandas`, `from_arrow`,
`from_polars`, or `register_view` helpers.
* Implemented weakref-based bindings cache (`_python_table_bindings`) to
detect reassignment or garbage collection of Python objects and
refresh/deregister session tables appropriately.
* **Error handling (Rust <-> Python bridge)**
* Enhanced the Rust wrapper so DataFusion errors that indicate missing
tables have a `missing_table_names` attribute on the Python exception object
when available. This enables robust detection of which table names caused
planning failures.
* Implemented a more robust parser (`collect_missing_table_names`) in Rust
to extract table names from common message formats, including nested
`DataFusionError::Context` / `Diagnostic` errors.
* **Documentation**
* Added documentation and examples for automatic variable registration to
`docs/source/user-guide/dataframe/index.rst` and
`docs/source/user-guide/sql.rst` demonstrating usage with pandas/pyarrow and
how to enable the feature.
* **Tests**
* Added many unit tests in `python/tests/test_context.py` covering:
* `test_sql_missing_table_without_auto_register`
* `test_sql_missing_table_exposes_missing_table_names`
* `test_extract_missing_table_names_from_attribute`
* `test_sql_auto_register_arrow_table`
* `test_sql_auto_register_multiple_tables_single_query`
* `test_sql_auto_register_arrow_outer_scope`
* `test_sql_auto_register_skips_none_shadowing`
* `test_sql_auto_register_case_insensitive_lookup`
* `test_sql_auto_register_pandas_dataframe`
* `test_sql_auto_register_refreshes_reassigned_dataframe`
* `test_sql_auto_register_polars_dataframe`
* `test_sql_from_local_arrow_table`
* `test_sql_from_local_pandas_dataframe`
* `test_sql_from_local_polars_dataframe`
* `test_sql_from_local_unsupported_object`
* `test_session_config_python_table_lookup_enables_auto_registration`
* `test_sql_auto_register_arrow`
* `test_sql_auto_register_disabled`
## Implementation notes / design decisions
* **Opt-in by default**: The feature is off unless the user either passes
`auto_register_python_objects=True` to `SessionContext(...)` or calls
`SessionConfig.with_python_table_lookup(True)` when creating the session config.
* **Call-stack introspection**: We walk Python frames (using `inspect`) to
find variables that match missing table names. Lookup is case-insensitive and
prefers exact name matches; it skips `None` shadowing to avoid registering
unintentionally shadowed values.
* **Caching & refresh**: A `weakref` reference to the registered Python
object and its `id()` are stored so we can detect reassignment or object
collection and refresh session bindings when needed.
* **Robust missing-table extraction**: Because DataFusion error messages
vary and the Python bindings may receive nested errors, we attempt to extract
missing table names from a `missing_table_names` attribute (added by Rust) and
fall back to regex-based extraction from the error message.
## Are these changes tested?
* Yes — multiple unit tests were added to `python/tests/test_context.py` to
exercise both the registration flow and the failure modes. The Rust side
changes are exercised indirectly via the Python tests which assert the presence
of `missing_table_names` in raised exceptions and the successful registration
behaviour.
If additional Rust unit tests are desired for the
`collect_missing_table_names` parsing helper, they can be added (not included
in this PR).
## Are there any user-facing changes?
* Yes. New optional behaviour that automatically registers Python objects
referenced in SQL when enabled. This is an **opt-in** feature and is **disabled
by default**.
* New configuration options & methods:
* `SessionConfig.with_python_table_lookup(enabled: bool)`
* `SessionContext(auto_register_python_objects=...)`
* `SessionContext.set_python_table_lookup(enabled: bool)`
* Documentation updated with examples demonstrating the feature.
### Backwards compatibility
No breaking API changes to existing functions. Default behaviour is
unchanged (feature disabled) so existing applications that rely on explicit
registration will not be affected.
## Example usage
```py
from datafusion import SessionContext, SessionConfig
import pandas as pd
# construct with session-level default enabled
ctx = SessionContext(config=SessionConfig().with_python_table_lookup(True))
pdf = pd.DataFrame({"value": [1,2,3]})
res = ctx.sql("SELECT SUM(value) AS total FROM pdf").to_pandas()
# or enable per-session
ctx2 = SessionContext(auto_register_python_objects=True)
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]