Polars) objects referenced in SQL [datafusion-python]

via GitHub Sun, 21 Sep 2025 08:30:09 -0700


kosiew opened a new pull request, #1247:
URL: https://github.com/apache/datafusion-python/pull/1247


   
   ## Which issue does this PR close?
   
   * Closes #1244
   
   ## Rationale for this change
   
   Users currently must explicitly register any in-memory Arrow/Pandas/Polars 
tables before running SQL. This makes quick exploratory workflows (where users 
create a DataFrame in Python and immediately query it with SQL) awkward because 
the user must call `from_pandas` / `from_arrow` or similar register helpers. 
The change implements an *opt-in* replacement-scan style lookup that inspects 
Python scopes to find variables whose names match missing table identifiers and 
automatically registers them if they expose Arrow-compatible data.
   
   The behaviour is safe-by-default (disabled), but can be enabled either at 
session construction time or via `SessionConfig`. It improves ergonomics for 
REPL/notebook workflows while preserving existing semantics for applications 
that require explicit registration.
   
   ## What changes are included in this PR?
   
   Summary of functional changes:
   
   * **Python API**
   
     * Added `SessionConfig.with_python_table_lookup(enabled: bool = True)` to 
configure default behaviour.
     * `SessionContext` constructor accepts `auto_register_python_objects: bool 
| None` to opt into automatic lookup at construction time. If omitted, it uses 
the `SessionConfig` setting (default `False`).
     * Added `SessionContext.set_python_table_lookup(enabled: bool = True)` to 
toggle behaviour at runtime.
     * `SessionContext.sql(...)` will, when the feature is enabled, attempt to 
introspect missing table names from DataFusion errors, look up variables in the 
calling Python stack, and automatically register matching objects: DataFusion 
`DataFrame` views, Polars DataFrame, pandas DataFrame, and Arrow `Table` / 
`RecordBatch` / `RecordBatchReader` or objects exposing Arrow C data 
interfaces. Registration uses the existing `from_pandas`, `from_arrow`, 
`from_polars`, or `register_view` helpers.
     * Implemented weakref-based bindings cache (`_python_table_bindings`) to 
detect reassignment or garbage collection of Python objects and 
refresh/deregister session tables appropriately.
   
   * **Error handling (Rust <-> Python bridge)**
   
     * Enhanced the Rust wrapper so DataFusion errors that indicate missing 
tables have a `missing_table_names` attribute on the Python exception object 
when available. This enables robust detection of which table names caused 
planning failures.
     * Implemented a more robust parser (`collect_missing_table_names`) in Rust 
to extract table names from common message formats, including nested 
`DataFusionError::Context` / `Diagnostic` errors.
   
   * **Documentation**
   
     * Added documentation and examples for automatic variable registration to 
`docs/source/user-guide/dataframe/index.rst` and 
`docs/source/user-guide/sql.rst` demonstrating usage with pandas/pyarrow and 
how to enable the feature.
   
   * **Tests**
   
     * Added many unit tests in `python/tests/test_context.py` covering:
   
       * `test_sql_missing_table_without_auto_register`
       * `test_sql_missing_table_exposes_missing_table_names`
       * `test_extract_missing_table_names_from_attribute`
       * `test_sql_auto_register_arrow_table`
       * `test_sql_auto_register_multiple_tables_single_query`
       * `test_sql_auto_register_arrow_outer_scope`
       * `test_sql_auto_register_skips_none_shadowing`
       * `test_sql_auto_register_case_insensitive_lookup`
       * `test_sql_auto_register_pandas_dataframe`
       * `test_sql_auto_register_refreshes_reassigned_dataframe`
       * `test_sql_auto_register_polars_dataframe`
       * `test_sql_from_local_arrow_table`
       * `test_sql_from_local_pandas_dataframe`
       * `test_sql_from_local_polars_dataframe`
       * `test_sql_from_local_unsupported_object`
       * `test_session_config_python_table_lookup_enables_auto_registration`
       * `test_sql_auto_register_arrow`
       * `test_sql_auto_register_disabled`
   
   ## Implementation notes / design decisions
   
   * **Opt-in by default**: The feature is off unless the user either passes 
`auto_register_python_objects=True` to `SessionContext(...)` or calls 
`SessionConfig.with_python_table_lookup(True)` when creating the session config.
   
   * **Call-stack introspection**: We walk Python frames (using `inspect`) to 
find variables that match missing table names. Lookup is case-insensitive and 
prefers exact name matches; it skips `None` shadowing to avoid registering 
unintentionally shadowed values.
   
   * **Caching & refresh**: A `weakref` reference to the registered Python 
object and its `id()` are stored so we can detect reassignment or object 
collection and refresh session bindings when needed.
   
   * **Robust missing-table extraction**: Because DataFusion error messages 
vary and the Python bindings may receive nested errors, we attempt to extract 
missing table names from a `missing_table_names` attribute (added by Rust) and 
fall back to regex-based extraction from the error message.
   
   ## Are these changes tested?
   
   * Yes — multiple unit tests were added to `python/tests/test_context.py` to 
exercise both the registration flow and the failure modes. The Rust side 
changes are exercised indirectly via the Python tests which assert the presence 
of `missing_table_names` in raised exceptions and the successful registration 
behaviour.
   
   If additional Rust unit tests are desired for the 
`collect_missing_table_names` parsing helper, they can be added (not included 
in this PR).
   
   ## Are there any user-facing changes?
   
   * Yes. New optional behaviour that automatically registers Python objects 
referenced in SQL when enabled. This is an **opt-in** feature and is **disabled 
by default**.
   
   * New configuration options & methods:
   
     * `SessionConfig.with_python_table_lookup(enabled: bool)`
     * `SessionContext(auto_register_python_objects=...)`
     * `SessionContext.set_python_table_lookup(enabled: bool)`
   
   * Documentation updated with examples demonstrating the feature.
   
   ### Backwards compatibility
   
   No breaking API changes to existing functions. Default behaviour is 
unchanged (feature disabled) so existing applications that rely on explicit 
registration will not be affected.
   
   ## Example usage
   
   ```py
   from datafusion import SessionContext, SessionConfig
   import pandas as pd
   
   # construct with session-level default enabled
   ctx = SessionContext(config=SessionConfig().with_python_table_lookup(True))
   pdf = pd.DataFrame({"value": [1,2,3]})
   res = ctx.sql("SELECT SUM(value) AS total FROM pdf").to_pandas()
   
   # or enable per-session
   ctx2 = SessionContext(auto_register_python_objects=True)
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL [datafusion-python]

Reply via email to