[I] [Improvement] Spark connector support Paimon distribution (CLUSTERED BY ... INTO ... BUCKETS) [gravitino]

via GitHub Wed, 11 Mar 2026 00:37:06 -0700


FANNG1 opened a new issue, #10367:
URL: https://github.com/apache/gravitino/issues/10367


   Spark Paimon currently does not expose distribution support in connector 
capability and docs, but the core path already has most of the building blocks.
   
   Background:
   - Spark Paimon IT disables clustered-by tests:
     - 
/Users/fanng/opensource/gravitino/spark-connector/spark-common/src/test/java/org/apache/gravitino/spark/connector/integration/test/paimon/SparkPaimonCatalogIT.java:47
   - Spark connector already converts bucket transforms to Gravitino 
distribution and passes it into createTable:
     - 
/Users/fanng/opensource/gravitino/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/SparkTransformConverter.java:140
     - 
/Users/fanng/opensource/gravitino/spark-connector/spark-common/src/main/java/org/apache/gravitino/spark/connector/catalog/BaseCatalog.java:209
   - Paimon backend already validates and applies HASH distribution:
     - 
/Users/fanng/opensource/gravitino/catalogs/catalog-lakehouse-paimon/src/main/java/org/apache/gravitino/catalog/lakehouse/paimon/PaimonCatalogOperations.java:505
   - Current doc says Spark Paimon does not support distribution:
     - 
/Users/fanng/opensource/gravitino/docs/spark-connector/spark-catalog-paimon.md:25
   
   Suggested SQL for validation:
   
   ```sql
   USE paimon_catalog.default;
   
   CREATE TABLE dist_tbl (
     id INT,
     name STRING,
     address STRING
   ) USING paimon
   CLUSTERED BY (id) INTO 4 BUCKETS;
   
   DESC TABLE EXTENDED dist_tbl;
   SHOW TBLPROPERTIES dist_tbl;
   ```
   
   How should we improve?
   - Enable distribution capability for Spark Paimon and add integration 
coverage.
   - Ensure `CLUSTERED BY (...) INTO N BUCKETS` is persisted as Paimon bucket 
metadata.
   - Add/enable ITs to verify table metadata after create/load.
   - Update Spark Paimon docs to mark distribution as supported.
   
   Out of scope:
   - Sort orders (`SORTED BY`) in this issue.
   - Row-level operations (`DELETE/UPDATE/MERGE`) in this issue.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [Improvement] Spark connector support Paimon distribution (CLUSTERED BY ... INTO ... BUCKETS) [gravitino]

Reply via email to