GGraziadei opened a new pull request, #16827:
URL: https://github.com/apache/iceberg/pull/16827
This PR lets you cluster table data along a Hilbert space-filling curve when
compacting files, a new option alongside the existing Z-order strategy.
Just run:
```sql
CALL system.rewrite_data_files(
table => 'db.tbl',
strategy => 'sort',
sort_order => 'hilbert(c1, c2)'
);
```
Why Hilbert? When you sort multi-column data onto a single curve, the goal
is for rows that are close in your columns to land close together on disk, so
queries with range filters can skip more files. Hilbert curves do this better
than Z-order because they never make the big "jumps" across space that Z-order
does neighbouring points on the curve are always true neighbours in the data.
In practice, that means tighter clustering and better data skipping for queries
that filter on several columns at once.
Under the hood, the change is small and low-risk: it reuses Iceberg's
existing, well-tested Z-order byte encodings and only adds the curve math on
top, following the standard algorithm from J. Skilling, *"Programming the
Hilbert curve"* (AIP Conf. Proc. 707, 381, 2004;
<https://doi.org/10.1063/1.1751381>). The Z-order code is left untouched. Scope
is limited to Spark 4.1 for now (other engines fall back gracefully), and the
change ships with full test coverage including curve-correctness checks
(bijection + neighbour-locality), runner/action/SQL tests, and a JMH benchmark.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]