LantaoJin opened a new pull request, #71: URL: https://github.com/apache/datafusion-java/pull/71
## Which issue does this PR close? - Closes #45 . ## Rationale for this change `DataFrame` introspection in the Java binding is limited to `count()`, `show()`, and `show(int)`. Java callers cannot programmatically retrieve a DataFrame's Arrow schema, inspect its query plan, materialise an intermediate result, or compute summary statistics — capabilities that DataFusion's Rust API has surfaced for years (`df.schema()`, `df.explain(verbose, analyze)`, `df.cache().await`, `df.describe().await`). The issue groups these four as a coherent introspection surface; their JNI plumbing is structurally identical (one `df.clone()` plus one upstream call), so shipping them together is the smallest reasonable PR. `schema()` and `explain()` are non-consuming and cheap — they just inspect the logical plan. `cache()` and `describe()` execute the plan eagerly, but the cost is paid by the caller's choice to invoke them; the Java API surface is the same shape as the cheap pair. ## What changes are included in this PR? - `DataFrame.schema() → Schema` — returns the Arrow schema via the existing IPC byte channel (mirrors `SessionContext.tableSchema`). No `BufferAllocator` required because schemas carry no buffer data. - `DataFrame.explain(boolean verbose, boolean analyze) → DataFrame` — wraps upstream's two-flag `df.explain`. The result is a lazy DataFrame whose batches contain the plan-type label and plan text; render via `show()` or `collect()`. - `DataFrame.cache() → DataFrame` — wraps `df.cache().await`. Materialises the plan into an in-memory table and returns a new DataFrame that scans it. - `DataFrame.describe() → DataFrame` — wraps `df.describe().await`. DataFusion runs seven aggregate sub-plans (`count`, `null_count`, `mean`, `std`, `min`, `max`, `median`) to build the summary table. - All four are non-consuming on the Java side: the receiver remains usable and must still be closed independently. This matches the convention already used by `select` / `filter` / `withColumn` / `unnestColumns`. Upstream takes `self` for `explain` / `cache` / `describe`, but the JNI layer clones the underlying `Arc`-backed `DataFrame` like every other transformation method. - All four guard against use after `close()` with the same `IllegalStateException` message used by every other DataFrame method. Native side (`native/src/lib.rs`): - `Java_org_apache_datafusion_DataFrame_schemaIpc` — non-consuming, returns Arrow IPC bytes via `StreamWriter`. - `Java_org_apache_datafusion_DataFrame_explainPlan` — non-consuming, runs `df.clone().explain(verbose, analyze)?` (sync upstream). - `Java_org_apache_datafusion_DataFrame_cachePlan` — non-consuming, runs `runtime().block_on(df.clone().cache())`. - `Java_org_apache_datafusion_DataFrame_describePlan` — non-consuming, runs `runtime().block_on(df.clone().describe())`. All four use the standard `try_unwrap_or_throw(...)` envelope so panics and `Err` surface as Java exceptions. Out of scope (for follow-ups): - `explain_with_options(ExplainOption)` — the typed-options variant exposes `ExplainFormat::Json` / `Tree` and per-context filters. Easy to add later as `explain(ExplainOptions)` once anyone files a follow-up; the issue asks for the two-boolean form. - `DataFrame.logicalPlan()` / `optimizedPlan()` / `physicalPlan()` — Java mirrors of upstream's plan getters. Not in scope for #45. - Cache-factory wiring — upstream's `SessionState::cache_factory()` lets users plug in alternative cache backends. This PR always uses the default `MemTable` path; pluggable cache factories belong on a separate config surface. - Column-filtered `describe()` — upstream has no API for this. Callers who need it can `df.select(...).describe()`. - ## Are these changes tested? Yes, 13 new tests in `core/src/test/java/org/apache/datafusion/DataFrameIntrospectionTest.java` ## Are there any user-facing changes? Yes, purely additive. New public API: - `DataFrame.schema()` - `DataFrame.explain(boolean, boolean)` - `DataFrame.cache()` - `DataFrame.describe()` No API removals, no deprecations, no behavior change for existing callers. The native binary is unchanged in size (no new Cargo features or dependencies). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
