LantaoJin opened a new pull request, #66:
URL: https://github.com/apache/datafusion-java/pull/66
Add DataFrame.sort(SortExpr...), DataFrame.repartitionRoundRobin(int), and
DataFrame.repartitionHash(int, String...). SortExpr is a small value class with
static asc/desc factories and a fluent nullsFirst setter, mirroring
DataFusion's expr::Sort.
The SQL-string sort flavour the issue lists as option 1 is deferred:
DataFusion 53.1 has no parse_sort_exprs helper on DataFrame, so the string
flavour would force hand-rolled ORDER BY parsing. The typed SortExpr API is the
same shape the issue authorises in option 2.
repartitionHash takes column-name keys for v1 and translates each through
col(...) in the native handler. Expression keys are deferred until a Java-side
Expr builder lands.
## Which issue does this PR close?
- Closes #42 .
## Rationale for this change
Two ordering / layout primitives have been missing from the Java `DataFrame`
API: `sort` (no way to order without dropping to SQL) and `repartition` (no way
to control parallelism / partitioning). Both are first-class on the upstream
Rust `DataFrame`, in the default feature set, with no Cargo flag impact. This
PR exposes them additively.
## What changes are included in this PR?
- `SortExpr` -- new value class. Final class with static factories
`SortExpr.asc(String)` / `SortExpr.desc(String)` and a fluent
`nullsFirst(boolean)` setter. Mirrors DataFusion's `expr::Sort{ expr, asc,
nulls_first }`. Defaults match upstream: ASC → NULLs last, DESC → NULLs first.
- `DataFrame.sort(SortExpr...)` -- ordering. Empty array is a no-op (matches
`DataFrame::sort(vec![])`); each `SortExpr` is null-checked Java-side; the
receiver remains usable.
- `DataFrame.repartitionRoundRobin(int)` -- maps to
`Partitioning::RoundRobinBatch(usize)`. Java validates `numPartitions > 0`.
- `DataFrame.repartitionHash(int, String...)` -- maps to
`Partitioning::Hash(Vec<Expr>, usize)`. Column-name keys for v1; the native
handler translates each name through `datafusion::logical_expr::col(...)`. Java
validates `numPartitions > 0`, columns non-null/non-empty, no null elements.
- `native/src/lib.rs` -- three JNI handlers (`sortRows`,
`repartitionRoundRobinRows`, `repartitionHashRows`) using the existing
`try_unwrap_or_throw` plumbing. Boolean arrays are decoded via `JBooleanArray`
+ `get_boolean_array_region` (jni 0.21).
- Imports added: `datafusion::logical_expr::{col, Partitioning, SortExpr}`,
`jni::objects::JBooleanArray`.
Why typed `SortExpr` instead of the SQL-string flavour the issue suggests as
option 1: `DataFrame::parse_sql_expr` parses a single expression, not an `ORDER
BY` list, and DataFusion 53.1 has no `parse_sort_exprs` helper. The string
flavour would force hand-rolled SQL parsing on the native side. The issue
authorises starting at option 2; the SQL-string flavour can be layered on later
if/when an `Expr` builder lands.
Out of scope (for follow-ups):
- SQL-string sort flavour (`df.sort("a ASC, b DESC NULLS FIRST")`).
- Sort-key complex expressions (`SortExpr.asc("a + b")`). The field is named
`column` (not `expr`) to make this contract enforceable.
- `Partitioning::DistributeBy` and `Partitioning::Hash` with arbitrary
expressions.
- Partition-count assertions in tests -- the binding does not yet expose
`collect_partitioned`. Tests assert the row-preservation invariant only.
## Are these changes tested?
Yes -- 20 new tests across `SortExprTest` and
`DataFrameTransformationsTest`, plus six new lines extending the existing
close/collect coverage.
## Are there any user-facing changes?
Yes -- purely additive. New public API:
- `org.apache.datafusion.SortExpr` (value class)
- `DataFrame.sort(SortExpr...) → DataFrame`
- `DataFrame.repartitionRoundRobin(int) → DataFrame`
- `DataFrame.repartitionHash(int, String...) → DataFrame`
No API removals, no deprecations, no behaviour change for existing callers.
No Cargo feature changes; binary size is unchanged.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]