QuakeWang opened a new pull request, #8046:
URL: https://github.com/apache/paimon/pull/8046
### Purpose
Ray HASH_FIXED writes previously pre-clustered every table with
`groupby().map_groups()` by `(partition_keys..., bucket)`. Ray requires each
mapped group to fit in memory on one node, so large buckets or hot partitions
could OOM during append-only writes.
This patch makes HASH_FIXED pre-clustering explicit:
- default `auto` writes append-only HASH_FIXED tables directly
- `map_groups` keeps the existing small-file optimization as an opt-in mode
- HASH_FIXED primary-key tables fail fast in `auto/off`, because direct
Ray writes can split one bucket across multiple writer tasks and allocate
overlapping sequence numbers
- `write_paimon()` and `TableWrite.write_ray()` use the same safety check
### Tests
- `pytest pypaimon/tests/test_ray_shuffle_helper.py
pypaimon/tests/ray_repartition_test.py`
- `pytest pypaimon/tests/ray_integration_test.py
pypaimon/tests/ray_data_test.py`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]