Re: [PR] [#9820] feat(catalogs): Support cluster mode, distributed and partitioning table for ClickHouse table [gravitino]

via GitHub Thu, 05 Feb 2026 06:17:43 -0800


Copilot commented on code in PR #9858:
URL: https://github.com/apache/gravitino/pull/9858#discussion_r2769414056



##########
api/src/main/java/org/apache/gravitino/rel/indexes/Index.java:
##########
@@ -111,6 +111,19 @@ enum IndexType {
     /** IVF_HNSW_FLAT */
     IVF_HNSW_SQ,
     /** IVF_HNSW_PQ */
-    IVF_HNSW_PQ;
+    IVF_HNSW_PQ,
+
+    // The following index types are data skipping indexes in ClickHouse
+    /** minmax data skipping index */
+    DATA_SKIPPING_MINMAX,
+
+    /** Bloom filter data skipping index */
+    DATA_SKIPPING_BLOOM_FILTER,
+
+    /** ngrambf_v1 data skipping index */
+    DATA_SKIPPING_NGRAMBF_V1,
+
+    /** tokenbf_v1 data skipping index */
+    DATA_SKIPPING_TOKENBF_V1;

Review Comment:
   The new index types DATA_SKIPPING_NGRAMBF_V1 and DATA_SKIPPING_TOKENBF_V1 
are added to the Index.IndexType enum but are not implemented in the ClickHouse 
catalog. If these index types are used, they will throw an 
IllegalArgumentException from the default case in the appendIndexesSql method 
(line 408-409). Either these index types should be implemented in the 
ClickHouse catalog's appendIndexesSql method, or they should be removed from 
this PR if they're not ready for use yet.
   ```suggestion
       DATA_SKIPPING_BLOOM_FILTER;
   ```



##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -170,6 +177,34 @@ protected String generateCreateTableSql(
     return result;
   }
 
+  /**
+   * Append CREATE TABLE clause. If cluster name && on-cluster is specified in 
properties, append ON
+   * CLUSTER clause.
+   *
+   * @param properties Table properties
+   * @param sqlBuilder SQL builder
+   * @return true if ON CLUSTER clause is appended, false otherwise
+   */
+  private boolean appendCreateTableClause(
+      Map<String, String> properties, StringBuilder sqlBuilder, String 
tableName) {
+    String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+    String onClusterValue = properties.get(ClusterConstants.ON_CLUSTER);
+
+    boolean onCluster =
+        StringUtils.isNotBlank(clusterName)
+            && StringUtils.isNotBlank(onClusterValue)
+            && 
Boolean.TRUE.equals(BooleanUtils.toBooleanObject(onClusterValue));

Review Comment:
   Inconsistent boolean parsing compared to 
ClickHouseDatabaseOperations.onCluster() which uses Boolean.parseBoolean(). The 
table operations use BooleanUtils.toBooleanObject() wrapped in 
Boolean.TRUE.equals(), while database operations use Boolean.parseBoolean(). 
These have different behavior - BooleanUtils.toBooleanObject() can return null 
for invalid values, while Boolean.parseBoolean() returns false. Consider using 
the same approach in both places for consistency.



##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -228,18 +263,93 @@ private static void appendOrderBy(
   }
 
   private ClickHouseTablePropertiesMetadata.ENGINE appendTableEngine(
-      Map<String, String> properties, StringBuilder sqlBuilder) {
+      Map<String, String> properties, StringBuilder sqlBuilder, boolean 
onCluster) {
     ClickHouseTablePropertiesMetadata.ENGINE engine = 
ENGINE_PROPERTY_ENTRY.getDefaultValue();
     if (MapUtils.isNotEmpty(properties)) {
-      String userSetEngine = properties.remove(CLICKHOUSE_ENGINE_KEY);
+      String userSetEngine = properties.get(CLICKHOUSE_ENGINE_KEY);
       if (StringUtils.isNotEmpty(userSetEngine)) {
         engine = 
ClickHouseTablePropertiesMetadata.ENGINE.fromString(userSetEngine);
       }
     }
+
+    if (engine == ENGINE.DISTRIBUTED) {
+      if (!onCluster) {
+        throw new IllegalArgumentException(
+            "ENGINE = DISTRIBUTED requires ON CLUSTER clause to be 
specified.");
+      }
+
+      // Check properties
+      String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+      String remoteDatabase = 
properties.get(DistributedTableConstants.REMOTE_DATABASE);
+      String remoteTable = 
properties.get(DistributedTableConstants.REMOTE_TABLE);
+      String shardingKey = 
properties.get(DistributedTableConstants.SHARDING_KEY);
+
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(clusterName),
+          "Cluster name must be specified when engine is Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(remoteDatabase),
+          "Remote database must be specified for Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(remoteTable), "Remote table must be specified 
for Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(shardingKey), "Sharding key must be specified 
for Distributed");
+
+      sqlBuilder.append(
+          "\n ENGINE = %s(`%s`,`%s`,`%s`,%s)"
+              .formatted(
+                  ENGINE.DISTRIBUTED.getValue(),
+                  clusterName,
+                  remoteDatabase,
+                  remoteTable,
+                  shardingKey));
+      return engine;
+    }
+
+    // Now check if engine is distributed, we need to check the remote 
database and table properties
+

Review Comment:
   This comment appears to be obsolete after the refactoring. The check for 
distributed engine has already been completed in the preceding if block (lines 
275-307), making this comment redundant and potentially confusing.
   ```suggestion
   
   ```



##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -228,18 +263,93 @@ private static void appendOrderBy(
   }
 
   private ClickHouseTablePropertiesMetadata.ENGINE appendTableEngine(
-      Map<String, String> properties, StringBuilder sqlBuilder) {
+      Map<String, String> properties, StringBuilder sqlBuilder, boolean 
onCluster) {
     ClickHouseTablePropertiesMetadata.ENGINE engine = 
ENGINE_PROPERTY_ENTRY.getDefaultValue();
     if (MapUtils.isNotEmpty(properties)) {
-      String userSetEngine = properties.remove(CLICKHOUSE_ENGINE_KEY);
+      String userSetEngine = properties.get(CLICKHOUSE_ENGINE_KEY);
       if (StringUtils.isNotEmpty(userSetEngine)) {
         engine = 
ClickHouseTablePropertiesMetadata.ENGINE.fromString(userSetEngine);
       }
     }
+
+    if (engine == ENGINE.DISTRIBUTED) {
+      if (!onCluster) {
+        throw new IllegalArgumentException(
+            "ENGINE = DISTRIBUTED requires ON CLUSTER clause to be 
specified.");
+      }
+
+      // Check properties
+      String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+      String remoteDatabase = 
properties.get(DistributedTableConstants.REMOTE_DATABASE);
+      String remoteTable = 
properties.get(DistributedTableConstants.REMOTE_TABLE);
+      String shardingKey = 
properties.get(DistributedTableConstants.SHARDING_KEY);
+
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(clusterName),
+          "Cluster name must be specified when engine is Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(remoteDatabase),
+          "Remote database must be specified for Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(remoteTable), "Remote table must be specified 
for Distributed");
+      Preconditions.checkArgument(
+          StringUtils.isNotBlank(shardingKey), "Sharding key must be specified 
for Distributed");
+
+      sqlBuilder.append(
+          "\n ENGINE = %s(`%s`,`%s`,`%s`,%s)"
+              .formatted(
+                  ENGINE.DISTRIBUTED.getValue(),
+                  clusterName,
+                  remoteDatabase,
+                  remoteTable,
+                  shardingKey));

Review Comment:
   The sharding key expression is inserted directly into the SQL without proper 
escaping or validation. While ClickHouse distributed engine accepts expressions 
here, this could be a security concern if the sharding key value comes from 
untrusted input. Consider validating that the sharding key is a simple column 
reference or safe expression, or document that users must ensure the sharding 
key is a trusted value.



##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -278,6 +388,22 @@ private void appendIndexesSql(Index[] indexes, 
StringBuilder sqlBuilder) {
           // fieldStr already quoted in getIndexFieldStr
           sqlBuilder.append(" PRIMARY KEY (").append(fieldStr).append(")");
           break;
+        case DATA_SKIPPING_MINMAX:
+          Preconditions.checkArgument(
+              StringUtils.isNotBlank(index.name()), "Data skipping index name 
must not be blank");
+          // The GRANULARITY value is always 1 here currently.
+          sqlBuilder.append(
+              " INDEX %s %s TYPE minmax GRANULARITY 1"
+                  .formatted(quoteIdentifier(index.name()), fieldStr));
+          break;
+        case DATA_SKIPPING_BLOOM_FILTER:
+          // The GRANULARITY value is always 3 here currently.
+          Preconditions.checkArgument(
+              StringUtils.isNotBlank(index.name()), "Data skipping index name 
must not be blank");
+          sqlBuilder.append(
+              " INDEX %s %s TYPE bloom_filter GRANULARITY 3"
+                  .formatted(quoteIdentifier(index.name()), fieldStr));

Review Comment:
   The hardcoded GRANULARITY values (1 for minmax, 3 for bloom_filter) should 
be documented or made configurable. Different use cases may require different 
granularity values for optimal performance. Consider adding a comment 
explaining why these specific values were chosen, or consider making them 
configurable through table properties.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#9820] feat(catalogs): Support cluster mode, distributed and partitioning table for ClickHouse table [gravitino]

Reply via email to