LantaoJin opened a new pull request, #72:
URL: https://github.com/apache/datafusion-java/pull/72

   ## Which issue does this PR close?
   
   - Closes #44 .
   
   ## Rationale for this change
   
   Issue **#44** is the design issue for `DataFrame.join` / `joinOn` plus the 
open question of how to represent DataFusion `Expr` values in Java. The issue 
offers three Phase-2 paths and recommends option (1) — SQL strings parsed via 
`parse_sql_expr` — for consistency with the existing `filter(String)` and 
`withColumn(String, String)` patterns. Option (2), a typed Java `Expr` builder 
mirroring DataFusion's class hierarchy, is explicitly flagged in the issue as 
carrying significant ongoing maintenance burden.
   
   This PR ships **Phase 1**: column-name `join` plus `joinOn` with String-form 
predicates. By committing to SQL-string predicates here, it also closes the 
Phase-2 design question per Andy's recommendation. A typed `Expr` builder is 
deliberately deferred.
   
   Joins are the largest missing piece of the DataFrame API today; without 
them, Java callers cannot programmatically express even simple star-join 
queries that DataFusion's Rust API has supported for years.
   
   ## What changes are included in this PR?
   
   New public Java enum `org.apache.datafusion.JoinType` mirroring upstream's 
10-variant enum (`INNER`, `LEFT`, `RIGHT`, `FULL`, `LEFT_SEMI`, `RIGHT_SEMI`, 
`LEFT_ANTI`, `RIGHT_ANTI`, `LEFT_MARK`, `RIGHT_MARK`). Crosses JNI as a `byte` 
code, mirroring the existing `Volatility` precedent for UDFs.
   
   Three new methods on `DataFrame`:
   
   - `DataFrame.join(right, type, leftCols, rightCols)` — equi-join on named 
columns, no residual filter.
   - `DataFrame.join(right, type, leftCols, rightCols, filter)` — equi-join 
with a residual SQL filter parsed against the *combined* schema of left + right.
   - `DataFrame.joinOn(right, type, predicates...)` — arbitrary join 
predicates, each parsed as a SQL expression against the combined schema. 
Predicates may be qualified with the relation alias when ambiguous (e.g. `"l.id 
= r.id"`, `"left.x < right.y"`).
   
   All three are non-consuming on the Java side: both `left` (the receiver) and 
`right` remain usable and must still be closed independently. Upstream's Rust 
`join` / `join_on` consume both, but the JNI layer clones the underlying 
`Arc`-backed `DataFrame` like every other transformation method (`select` / 
`filter` / `withColumn` / `unnestColumns`). This matches the established 
Java-side convention and supports the natural star-join pattern where the same 
fact table is joined to multiple dimensions:
   
   ```java
   DataFrame fact = ctx.sql(...);
   DataFrame d1   = fact.join(dimA, INNER, fact_keys, dimA_keys);
   DataFrame d2   = fact.join(dimB, INNER, fact_keys, dimB_keys);  // fact 
still usable
   ```
   
   Native side (`native/src/lib.rs`):
   
   - `Java_org_apache_datafusion_DataFrame_joinDataFrame` — column-name 
equi-join, with optional residual filter parsed via 
`state.create_logical_expr(sql, &combined_schema)`.
   - `Java_org_apache_datafusion_DataFrame_joinOnDataFrame` — predicate-based 
join; each predicate parsed against the combined schema.
   - `join_type_from_byte` helper, mirroring `volatility_from_byte`.
   - `collect_jstring_array` helper extracted from the existing repeated 
pattern.
   
   Predicate parsing reaches the `SessionState` via `df.clone().into_parts()` — 
`DataFrame::session_state` is private but `into_parts()` is public, and the 
clone keeps the original DataFrame intact. The combined schema is 
`left.schema().join(right.schema())?` — `DFSchema::join` errors on duplicate 
column names, the same constraint upstream's join enforces. Callers with name 
collisions must alias upstream.
   
   No new proto, no new Cargo dependencies, no new module, no Java public-API 
changes outside `DataFrame` and the new `JoinType` enum.
   
   Out of scope (for follow-ups):
   
   - **Typed `Expr` builder** (Phase 2 option 2). Andy's issue explicitly 
cautions against the maintenance burden. The String-form predicate channel 
covers everything DataFusion's parser supports.
   - **Cross-join, natural-join.** Not in #44; separate issues.
   - **Predicate validation pre-flight on the Java side.** DataFusion's parser 
is the source of truth and surfaces errors the same way `filter(String)` 
already does.
   
   ## Are these changes tested?
   
   Yes, 18 new tests in 
`core/src/test/java/org/apache/datafusion/DataFrameJoinTest.java`
   
   ## Are there any user-facing changes?
   
   Yes, purely additive. New public API:
   
   - `org.apache.datafusion.JoinType` (enum)
   - `DataFrame.join(DataFrame, JoinType, String[], String[])`
   - `DataFrame.join(DataFrame, JoinType, String[], String[], String)`
   - `DataFrame.joinOn(DataFrame, JoinType, String...)`
   
   No API removals, no deprecations, no behavior change for existing callers. 
The native binary is unchanged in size (no new Cargo features or dependencies).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to