Re: [PR] [python] Pre-repartition Ray writes by (partition, bucket) for fixed-bucket tables [paimon]

via GitHub Fri, 15 May 2026 00:15:33 -0700


TheR1sing3un commented on code in PR #7813:
URL: https://github.com/apache/paimon/pull/7813#discussion_r3246519357



##########
paimon-python/pypaimon/ray/ray_paimon.py:
##########
@@ -117,15 +119,39 @@ def write_paimon(
         table_identifier: Full table name, e.g. ``"db_name.table_name"``.
         catalog_options: Options passed to ``CatalogFactory.create()``.
         overwrite: If ``True``, overwrite existing data in the table.
+        shuffle: When ``True`` and the target is a HASH_FIXED table, cluster

Review Comment:
   > Why we need to add this option? Just shuffle for bucketed table?
   
   done~



##########
paimon-python/pypaimon/ray/ray_paimon.py:
##########
@@ -117,15 +119,39 @@ def write_paimon(
         table_identifier: Full table name, e.g. ``"db_name.table_name"``.
         catalog_options: Options passed to ``CatalogFactory.create()``.
         overwrite: If ``True``, overwrite existing data in the table.
+        shuffle: When ``True`` and the target is a HASH_FIXED table, cluster
+            rows by ``(partition_keys..., bucket)`` so each (partition,
+            bucket) lands in one Ray task — reduces the small-file count
+            for distributed writes. Non-HASH_FIXED tables log a warning
+            and fall back to no-shuffle. Defaults to ``False`` (Ray's
+            default round-robin distribution).
+        override_num_blocks: Optional Ray output block count. Must be

Review Comment:
   > It can be directly derived from the number of buckets, it is not 
necessarily necessary
   
   done~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [python] Pre-repartition Ray writes by (partition, bucket) for fixed-bucket tables [paimon]

Reply via email to