[PR] [python] Make HASH_FIXED Ray pre-clustering opt-in [paimon]

via GitHub Sat, 30 May 2026 21:46:43 -0700


QuakeWang opened a new pull request, #8046:
URL: https://github.com/apache/paimon/pull/8046


   ### Purpose
   
   Ray HASH_FIXED writes previously pre-clustered every table with 
`groupby().map_groups()` by `(partition_keys..., bucket)`. Ray requires each 
mapped group to fit in memory on one node, so large buckets or hot partitions 
could OOM during append-only writes.
   
   This patch makes HASH_FIXED pre-clustering explicit:
   
     - default `auto` writes append-only HASH_FIXED tables directly
     - `map_groups` keeps the existing small-file optimization as an opt-in mode
     - HASH_FIXED primary-key tables fail fast in `auto/off`, because direct 
Ray writes can split one bucket across multiple writer tasks and allocate 
overlapping sequence numbers
     - `write_paimon()` and `TableWrite.write_ray()` use the same safety check
     
   ### Tests
   
     - `pytest pypaimon/tests/test_ray_shuffle_helper.py 
pypaimon/tests/ray_repartition_test.py`
     - `pytest pypaimon/tests/ray_integration_test.py 
pypaimon/tests/ray_data_test.py`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [python] Make HASH_FIXED Ray pre-clustering opt-in [paimon]

Reply via email to