[PR] feat(dataframe): add schema, explain, cache, and describe introspection methods [datafusion-java]

via GitHub Wed, 20 May 2026 00:12:19 -0700


LantaoJin opened a new pull request, #71:
URL: https://github.com/apache/datafusion-java/pull/71


   ## Which issue does this PR close?
   
   - Closes #45 .
   
   ## Rationale for this change
   
   `DataFrame` introspection in the Java binding is limited to `count()`, 
`show()`, and `show(int)`. Java callers cannot programmatically retrieve a 
DataFrame's Arrow schema, inspect its query plan, materialise an intermediate 
result, or compute summary statistics — capabilities that DataFusion's Rust API 
has surfaced for years (`df.schema()`, `df.explain(verbose, analyze)`, 
`df.cache().await`, `df.describe().await`). The issue groups these four as a 
coherent introspection surface; their JNI plumbing is structurally identical 
(one `df.clone()` plus one upstream call), so shipping them together is the 
smallest reasonable PR.
   
   `schema()` and `explain()` are non-consuming and cheap — they just inspect 
the logical plan. `cache()` and `describe()` execute the plan eagerly, but the 
cost is paid by the caller's choice to invoke them; the Java API surface is the 
same shape as the cheap pair.
   
   ## What changes are included in this PR?
   
   - `DataFrame.schema() → Schema` — returns the Arrow schema via the existing 
IPC byte channel (mirrors `SessionContext.tableSchema`). No `BufferAllocator` 
required because schemas carry no buffer data.
   - `DataFrame.explain(boolean verbose, boolean analyze) → DataFrame` — wraps 
upstream's two-flag `df.explain`. The result is a lazy DataFrame whose batches 
contain the plan-type label and plan text; render via `show()` or `collect()`.
   - `DataFrame.cache() → DataFrame` — wraps `df.cache().await`. Materialises 
the plan into an in-memory table and returns a new DataFrame that scans it.
   - `DataFrame.describe() → DataFrame` — wraps `df.describe().await`. 
DataFusion runs seven aggregate sub-plans (`count`, `null_count`, `mean`, 
`std`, `min`, `max`, `median`) to build the summary table.
   - All four are non-consuming on the Java side: the receiver remains usable 
and must still be closed independently. This matches the convention already 
used by `select` / `filter` / `withColumn` / `unnestColumns`. Upstream takes 
`self` for `explain` / `cache` / `describe`, but the JNI layer clones the 
underlying `Arc`-backed `DataFrame` like every other transformation method.
   - All four guard against use after `close()` with the same 
`IllegalStateException` message used by every other DataFrame method.
   
   Native side (`native/src/lib.rs`):
   
   - `Java_org_apache_datafusion_DataFrame_schemaIpc` — non-consuming, returns 
Arrow IPC bytes via `StreamWriter`.
   - `Java_org_apache_datafusion_DataFrame_explainPlan` — non-consuming, runs 
`df.clone().explain(verbose, analyze)?` (sync upstream).
   - `Java_org_apache_datafusion_DataFrame_cachePlan` — non-consuming, runs 
`runtime().block_on(df.clone().cache())`.
   - `Java_org_apache_datafusion_DataFrame_describePlan` — non-consuming, runs 
`runtime().block_on(df.clone().describe())`.
   
   All four use the standard `try_unwrap_or_throw(...)` envelope so panics and 
`Err` surface as Java exceptions.
   
   Out of scope (for follow-ups):
   
   - `explain_with_options(ExplainOption)` — the typed-options variant exposes 
`ExplainFormat::Json` / `Tree` and per-context filters. Easy to add later as 
`explain(ExplainOptions)` once anyone files a follow-up; the issue asks for the 
two-boolean form.
   - `DataFrame.logicalPlan()` / `optimizedPlan()` / `physicalPlan()` — Java 
mirrors of upstream's plan getters. Not in scope for #45.
   - Cache-factory wiring — upstream's `SessionState::cache_factory()` lets 
users plug in alternative cache backends. This PR always uses the default 
`MemTable` path; pluggable cache factories belong on a separate config surface.
   - Column-filtered `describe()` — upstream has no API for this. Callers who 
need it can `df.select(...).describe()`.
   - 
   ## Are these changes tested?
   
   Yes, 13 new tests in 
`core/src/test/java/org/apache/datafusion/DataFrameIntrospectionTest.java`
   
   ## Are there any user-facing changes?
   
   Yes, purely additive. New public API:
   
   - `DataFrame.schema()`
   - `DataFrame.explain(boolean, boolean)`
   - `DataFrame.cache()`
   - `DataFrame.describe()`
   
   No API removals, no deprecations, no behavior change for existing callers. 
The native binary is unchanged in size (no new Cargo features or dependencies).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat(dataframe): add schema, explain, cache, and describe introspection methods [datafusion-java]

Reply via email to