Copilot commented on code in PR #9858:
URL: https://github.com/apache/gravitino/pull/9858#discussion_r2769414056
##########
api/src/main/java/org/apache/gravitino/rel/indexes/Index.java:
##########
@@ -111,6 +111,19 @@ enum IndexType {
/** IVF_HNSW_FLAT */
IVF_HNSW_SQ,
/** IVF_HNSW_PQ */
- IVF_HNSW_PQ;
+ IVF_HNSW_PQ,
+
+ // The following index types are data skipping indexes in ClickHouse
+ /** minmax data skipping index */
+ DATA_SKIPPING_MINMAX,
+
+ /** Bloom filter data skipping index */
+ DATA_SKIPPING_BLOOM_FILTER,
+
+ /** ngrambf_v1 data skipping index */
+ DATA_SKIPPING_NGRAMBF_V1,
+
+ /** tokenbf_v1 data skipping index */
+ DATA_SKIPPING_TOKENBF_V1;
Review Comment:
The new index types DATA_SKIPPING_NGRAMBF_V1 and DATA_SKIPPING_TOKENBF_V1
are added to the Index.IndexType enum but are not implemented in the ClickHouse
catalog. If these index types are used, they will throw an
IllegalArgumentException from the default case in the appendIndexesSql method
(line 408-409). Either these index types should be implemented in the
ClickHouse catalog's appendIndexesSql method, or they should be removed from
this PR if they're not ready for use yet.
```suggestion
DATA_SKIPPING_BLOOM_FILTER;
```
##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -170,6 +177,34 @@ protected String generateCreateTableSql(
return result;
}
+ /**
+ * Append CREATE TABLE clause. If cluster name && on-cluster is specified in
properties, append ON
+ * CLUSTER clause.
+ *
+ * @param properties Table properties
+ * @param sqlBuilder SQL builder
+ * @return true if ON CLUSTER clause is appended, false otherwise
+ */
+ private boolean appendCreateTableClause(
+ Map<String, String> properties, StringBuilder sqlBuilder, String
tableName) {
+ String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+ String onClusterValue = properties.get(ClusterConstants.ON_CLUSTER);
+
+ boolean onCluster =
+ StringUtils.isNotBlank(clusterName)
+ && StringUtils.isNotBlank(onClusterValue)
+ &&
Boolean.TRUE.equals(BooleanUtils.toBooleanObject(onClusterValue));
Review Comment:
Inconsistent boolean parsing compared to
ClickHouseDatabaseOperations.onCluster() which uses Boolean.parseBoolean(). The
table operations use BooleanUtils.toBooleanObject() wrapped in
Boolean.TRUE.equals(), while database operations use Boolean.parseBoolean().
These have different behavior - BooleanUtils.toBooleanObject() can return null
for invalid values, while Boolean.parseBoolean() returns false. Consider using
the same approach in both places for consistency.
##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -228,18 +263,93 @@ private static void appendOrderBy(
}
private ClickHouseTablePropertiesMetadata.ENGINE appendTableEngine(
- Map<String, String> properties, StringBuilder sqlBuilder) {
+ Map<String, String> properties, StringBuilder sqlBuilder, boolean
onCluster) {
ClickHouseTablePropertiesMetadata.ENGINE engine =
ENGINE_PROPERTY_ENTRY.getDefaultValue();
if (MapUtils.isNotEmpty(properties)) {
- String userSetEngine = properties.remove(CLICKHOUSE_ENGINE_KEY);
+ String userSetEngine = properties.get(CLICKHOUSE_ENGINE_KEY);
if (StringUtils.isNotEmpty(userSetEngine)) {
engine =
ClickHouseTablePropertiesMetadata.ENGINE.fromString(userSetEngine);
}
}
+
+ if (engine == ENGINE.DISTRIBUTED) {
+ if (!onCluster) {
+ throw new IllegalArgumentException(
+ "ENGINE = DISTRIBUTED requires ON CLUSTER clause to be
specified.");
+ }
+
+ // Check properties
+ String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+ String remoteDatabase =
properties.get(DistributedTableConstants.REMOTE_DATABASE);
+ String remoteTable =
properties.get(DistributedTableConstants.REMOTE_TABLE);
+ String shardingKey =
properties.get(DistributedTableConstants.SHARDING_KEY);
+
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(clusterName),
+ "Cluster name must be specified when engine is Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(remoteDatabase),
+ "Remote database must be specified for Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(remoteTable), "Remote table must be specified
for Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(shardingKey), "Sharding key must be specified
for Distributed");
+
+ sqlBuilder.append(
+ "\n ENGINE = %s(`%s`,`%s`,`%s`,%s)"
+ .formatted(
+ ENGINE.DISTRIBUTED.getValue(),
+ clusterName,
+ remoteDatabase,
+ remoteTable,
+ shardingKey));
+ return engine;
+ }
+
+ // Now check if engine is distributed, we need to check the remote
database and table properties
+
Review Comment:
This comment appears to be obsolete after the refactoring. The check for
distributed engine has already been completed in the preceding if block (lines
275-307), making this comment redundant and potentially confusing.
```suggestion
```
##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -228,18 +263,93 @@ private static void appendOrderBy(
}
private ClickHouseTablePropertiesMetadata.ENGINE appendTableEngine(
- Map<String, String> properties, StringBuilder sqlBuilder) {
+ Map<String, String> properties, StringBuilder sqlBuilder, boolean
onCluster) {
ClickHouseTablePropertiesMetadata.ENGINE engine =
ENGINE_PROPERTY_ENTRY.getDefaultValue();
if (MapUtils.isNotEmpty(properties)) {
- String userSetEngine = properties.remove(CLICKHOUSE_ENGINE_KEY);
+ String userSetEngine = properties.get(CLICKHOUSE_ENGINE_KEY);
if (StringUtils.isNotEmpty(userSetEngine)) {
engine =
ClickHouseTablePropertiesMetadata.ENGINE.fromString(userSetEngine);
}
}
+
+ if (engine == ENGINE.DISTRIBUTED) {
+ if (!onCluster) {
+ throw new IllegalArgumentException(
+ "ENGINE = DISTRIBUTED requires ON CLUSTER clause to be
specified.");
+ }
+
+ // Check properties
+ String clusterName = properties.get(ClusterConstants.CLUSTER_NAME);
+ String remoteDatabase =
properties.get(DistributedTableConstants.REMOTE_DATABASE);
+ String remoteTable =
properties.get(DistributedTableConstants.REMOTE_TABLE);
+ String shardingKey =
properties.get(DistributedTableConstants.SHARDING_KEY);
+
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(clusterName),
+ "Cluster name must be specified when engine is Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(remoteDatabase),
+ "Remote database must be specified for Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(remoteTable), "Remote table must be specified
for Distributed");
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(shardingKey), "Sharding key must be specified
for Distributed");
+
+ sqlBuilder.append(
+ "\n ENGINE = %s(`%s`,`%s`,`%s`,%s)"
+ .formatted(
+ ENGINE.DISTRIBUTED.getValue(),
+ clusterName,
+ remoteDatabase,
+ remoteTable,
+ shardingKey));
Review Comment:
The sharding key expression is inserted directly into the SQL without proper
escaping or validation. While ClickHouse distributed engine accepts expressions
here, this could be a security concern if the sharding key value comes from
untrusted input. Consider validating that the sharding key is a simple column
reference or safe expression, or document that users must ensure the sharding
key is a trusted value.
##########
catalogs-contrib/catalog-jdbc-clickhouse/src/main/java/org/apache/gravitino/catalog/clickhouse/operations/ClickHouseTableOperations.java:
##########
@@ -278,6 +388,22 @@ private void appendIndexesSql(Index[] indexes,
StringBuilder sqlBuilder) {
// fieldStr already quoted in getIndexFieldStr
sqlBuilder.append(" PRIMARY KEY (").append(fieldStr).append(")");
break;
+ case DATA_SKIPPING_MINMAX:
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(index.name()), "Data skipping index name
must not be blank");
+ // The GRANULARITY value is always 1 here currently.
+ sqlBuilder.append(
+ " INDEX %s %s TYPE minmax GRANULARITY 1"
+ .formatted(quoteIdentifier(index.name()), fieldStr));
+ break;
+ case DATA_SKIPPING_BLOOM_FILTER:
+ // The GRANULARITY value is always 3 here currently.
+ Preconditions.checkArgument(
+ StringUtils.isNotBlank(index.name()), "Data skipping index name
must not be blank");
+ sqlBuilder.append(
+ " INDEX %s %s TYPE bloom_filter GRANULARITY 3"
+ .formatted(quoteIdentifier(index.name()), fieldStr));
Review Comment:
The hardcoded GRANULARITY values (1 for minmax, 3 for bloom_filter) should
be documented or made configurable. Different use cases may require different
granularity values for optimal performance. Consider adding a comment
explaining why these specific values were chosen, or consider making them
configurable through table properties.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]