LantaoJin opened a new pull request, #72: URL: https://github.com/apache/datafusion-java/pull/72
## Which issue does this PR close? - Closes #44 . ## Rationale for this change Issue **#44** is the design issue for `DataFrame.join` / `joinOn` plus the open question of how to represent DataFusion `Expr` values in Java. The issue offers three Phase-2 paths and recommends option (1) — SQL strings parsed via `parse_sql_expr` — for consistency with the existing `filter(String)` and `withColumn(String, String)` patterns. Option (2), a typed Java `Expr` builder mirroring DataFusion's class hierarchy, is explicitly flagged in the issue as carrying significant ongoing maintenance burden. This PR ships **Phase 1**: column-name `join` plus `joinOn` with String-form predicates. By committing to SQL-string predicates here, it also closes the Phase-2 design question per Andy's recommendation. A typed `Expr` builder is deliberately deferred. Joins are the largest missing piece of the DataFrame API today; without them, Java callers cannot programmatically express even simple star-join queries that DataFusion's Rust API has supported for years. ## What changes are included in this PR? New public Java enum `org.apache.datafusion.JoinType` mirroring upstream's 10-variant enum (`INNER`, `LEFT`, `RIGHT`, `FULL`, `LEFT_SEMI`, `RIGHT_SEMI`, `LEFT_ANTI`, `RIGHT_ANTI`, `LEFT_MARK`, `RIGHT_MARK`). Crosses JNI as a `byte` code, mirroring the existing `Volatility` precedent for UDFs. Three new methods on `DataFrame`: - `DataFrame.join(right, type, leftCols, rightCols)` — equi-join on named columns, no residual filter. - `DataFrame.join(right, type, leftCols, rightCols, filter)` — equi-join with a residual SQL filter parsed against the *combined* schema of left + right. - `DataFrame.joinOn(right, type, predicates...)` — arbitrary join predicates, each parsed as a SQL expression against the combined schema. Predicates may be qualified with the relation alias when ambiguous (e.g. `"l.id = r.id"`, `"left.x < right.y"`). All three are non-consuming on the Java side: both `left` (the receiver) and `right` remain usable and must still be closed independently. Upstream's Rust `join` / `join_on` consume both, but the JNI layer clones the underlying `Arc`-backed `DataFrame` like every other transformation method (`select` / `filter` / `withColumn` / `unnestColumns`). This matches the established Java-side convention and supports the natural star-join pattern where the same fact table is joined to multiple dimensions: ```java DataFrame fact = ctx.sql(...); DataFrame d1 = fact.join(dimA, INNER, fact_keys, dimA_keys); DataFrame d2 = fact.join(dimB, INNER, fact_keys, dimB_keys); // fact still usable ``` Native side (`native/src/lib.rs`): - `Java_org_apache_datafusion_DataFrame_joinDataFrame` — column-name equi-join, with optional residual filter parsed via `state.create_logical_expr(sql, &combined_schema)`. - `Java_org_apache_datafusion_DataFrame_joinOnDataFrame` — predicate-based join; each predicate parsed against the combined schema. - `join_type_from_byte` helper, mirroring `volatility_from_byte`. - `collect_jstring_array` helper extracted from the existing repeated pattern. Predicate parsing reaches the `SessionState` via `df.clone().into_parts()` — `DataFrame::session_state` is private but `into_parts()` is public, and the clone keeps the original DataFrame intact. The combined schema is `left.schema().join(right.schema())?` — `DFSchema::join` errors on duplicate column names, the same constraint upstream's join enforces. Callers with name collisions must alias upstream. No new proto, no new Cargo dependencies, no new module, no Java public-API changes outside `DataFrame` and the new `JoinType` enum. Out of scope (for follow-ups): - **Typed `Expr` builder** (Phase 2 option 2). Andy's issue explicitly cautions against the maintenance burden. The String-form predicate channel covers everything DataFusion's parser supports. - **Cross-join, natural-join.** Not in #44; separate issues. - **Predicate validation pre-flight on the Java side.** DataFusion's parser is the source of truth and surfaces errors the same way `filter(String)` already does. ## Are these changes tested? Yes, 18 new tests in `core/src/test/java/org/apache/datafusion/DataFrameJoinTest.java` ## Are there any user-facing changes? Yes, purely additive. New public API: - `org.apache.datafusion.JoinType` (enum) - `DataFrame.join(DataFrame, JoinType, String[], String[])` - `DataFrame.join(DataFrame, JoinType, String[], String[], String)` - `DataFrame.joinOn(DataFrame, JoinType, String...)` No API removals, no deprecations, no behavior change for existing callers. The native binary is unchanged in size (no new Cargo features or dependencies). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
