LantaoJin opened a new pull request, #66:
URL: https://github.com/apache/datafusion-java/pull/66

   Add DataFrame.sort(SortExpr...), DataFrame.repartitionRoundRobin(int), and 
DataFrame.repartitionHash(int, String...). SortExpr is a small value class with 
static asc/desc factories and a fluent nullsFirst setter, mirroring 
DataFusion's expr::Sort.
   
   The SQL-string sort flavour the issue lists as option 1 is deferred: 
DataFusion 53.1 has no parse_sort_exprs helper on DataFrame, so the string 
flavour would force hand-rolled ORDER BY parsing. The typed SortExpr API is the 
same shape the issue authorises in option 2.
   
   repartitionHash takes column-name keys for v1 and translates each through 
col(...) in the native handler. Expression keys are deferred until a Java-side 
Expr builder lands.
   
   ## Which issue does this PR close?
   
   - Closes #42 .
   
   ## Rationale for this change
   
   Two ordering / layout primitives have been missing from the Java `DataFrame` 
API: `sort` (no way to order without dropping to SQL) and `repartition` (no way 
to control parallelism / partitioning). Both are first-class on the upstream 
Rust `DataFrame`, in the default feature set, with no Cargo flag impact. This 
PR exposes them additively.
   
   ## What changes are included in this PR?
   
   - `SortExpr` -- new value class. Final class with static factories 
`SortExpr.asc(String)` / `SortExpr.desc(String)` and a fluent 
`nullsFirst(boolean)` setter. Mirrors DataFusion's `expr::Sort{ expr, asc, 
nulls_first }`. Defaults match upstream: ASC → NULLs last, DESC → NULLs first.
   - `DataFrame.sort(SortExpr...)` -- ordering. Empty array is a no-op (matches 
`DataFrame::sort(vec![])`); each `SortExpr` is null-checked Java-side; the 
receiver remains usable.
   - `DataFrame.repartitionRoundRobin(int)` -- maps to 
`Partitioning::RoundRobinBatch(usize)`. Java validates `numPartitions > 0`.
   - `DataFrame.repartitionHash(int, String...)` -- maps to 
`Partitioning::Hash(Vec<Expr>, usize)`. Column-name keys for v1; the native 
handler translates each name through `datafusion::logical_expr::col(...)`. Java 
validates `numPartitions > 0`, columns non-null/non-empty, no null elements.
   - `native/src/lib.rs` -- three JNI handlers (`sortRows`, 
`repartitionRoundRobinRows`, `repartitionHashRows`) using the existing 
`try_unwrap_or_throw` plumbing. Boolean arrays are decoded via `JBooleanArray` 
+ `get_boolean_array_region` (jni 0.21).
   - Imports added: `datafusion::logical_expr::{col, Partitioning, SortExpr}`, 
`jni::objects::JBooleanArray`.
   
   Why typed `SortExpr` instead of the SQL-string flavour the issue suggests as 
option 1: `DataFrame::parse_sql_expr` parses a single expression, not an `ORDER 
BY` list, and DataFusion 53.1 has no `parse_sort_exprs` helper. The string 
flavour would force hand-rolled SQL parsing on the native side. The issue 
authorises starting at option 2; the SQL-string flavour can be layered on later 
if/when an `Expr` builder lands.
   
   Out of scope (for follow-ups):
   
   - SQL-string sort flavour (`df.sort("a ASC, b DESC NULLS FIRST")`).
   - Sort-key complex expressions (`SortExpr.asc("a + b")`). The field is named 
`column` (not `expr`) to make this contract enforceable.
   - `Partitioning::DistributeBy` and `Partitioning::Hash` with arbitrary 
expressions.
   - Partition-count assertions in tests -- the binding does not yet expose 
`collect_partitioned`. Tests assert the row-preservation invariant only.
   
   ## Are these changes tested?
   
   Yes -- 20 new tests across `SortExprTest` and 
`DataFrameTransformationsTest`, plus six new lines extending the existing 
close/collect coverage.
   
   ## Are there any user-facing changes?
   
   Yes -- purely additive. New public API:
   
   - `org.apache.datafusion.SortExpr` (value class)
   - `DataFrame.sort(SortExpr...) → DataFrame`
   - `DataFrame.repartitionRoundRobin(int) → DataFrame`
   - `DataFrame.repartitionHash(int, String...) → DataFrame`
   
   No API removals, no deprecations, no behaviour change for existing callers. 
No Cargo feature changes; binary size is unchanged.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to