GGraziadei opened a new pull request, #16827:
URL: https://github.com/apache/iceberg/pull/16827

   This PR lets you cluster table data along a Hilbert space-filling curve when 
compacting files, a new option alongside the existing Z-order strategy.
   
   Just run:
   ```sql
   CALL system.rewrite_data_files(
     table => 'db.tbl',
     strategy => 'sort',
     sort_order => 'hilbert(c1, c2)'
   );
   ```
   
   Why Hilbert?  When you sort multi-column data onto a single curve, the goal 
is for rows that are close in your columns to land close together on disk, so 
queries with range filters can skip more files. Hilbert curves do this better 
than Z-order because they never make the big "jumps" across space that Z-order 
does neighbouring points on the curve are always true neighbours in the data. 
In practice, that means tighter clustering and better data skipping for queries 
that filter on several columns at once.
   
   Under the hood, the change is small and low-risk: it reuses Iceberg's 
existing, well-tested Z-order byte encodings and only adds the curve math on 
top, following the standard algorithm from J. Skilling, *"Programming the 
Hilbert curve"* (AIP Conf. Proc. 707, 381, 2004; 
<https://doi.org/10.1063/1.1751381>). The Z-order code is left untouched. Scope 
is limited to Spark 4.1 for now (other engines fall back gracefully), and the 
change ships with full test coverage including curve-correctness checks 
(bijection + neighbour-locality), runner/action/SQL tests, and a JMH benchmark.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to