Re: [PR] [HUDI-8768] Support bloom filter options when creating expr index using bloom filter [hudi]

via GitHub Tue, 11 Mar 2025 10:12:26 -0700


yihua commented on code in PR #12919:
URL: https://github.com/apache/hudi/pull/12919#discussion_r1987854711



##########
hudi-common/src/main/java/org/apache/hudi/index/expression/HoodieExpressionIndex.java:
##########
@@ -47,6 +52,20 @@ public interface HoodieExpressionIndex<S, T> extends 
Serializable {
   String DAYS_OPTION = "days";
   String FORMAT_OPTION = "format";
   String IDENTITY_TRANSFORM = "identity";
+  // Bloom filter options
+  String BLOOM_FILTER_TYPE = "filtertype";
+  String BLOOM_FILTER_NUM_ENTRIES = "numentries";
+  String FALSE_POSITIVE_RATE = "fpp";
+  String DYNAMIC_BLOOM_MAX_ENTRIES = "maxentries";

Review Comment:
   PostgreSQL uses the same pattern of concatenating words without underscore 
(`_`) for param names so I think this is OK as we follow the same convention.
   https://www.postgresql.org/docs/current/sql-createindex.html
   ```
   CREATE UNIQUE INDEX title_idx ON films (title) WITH (fillfactor = 70);
   CREATE INDEX gin_idx ON documents_table USING GIN (locations) WITH 
(fastupdate = off);
   ```



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/feature/index/TestExpressionIndex.scala:
##########
@@ -1776,7 +1777,26 @@ class TestExpressionIndex extends HoodieSparkSqlTestBase 
{
              |""".stripMargin)
 
         // create index using bloom filters on city column with upper() 
function
-        spark.sql(s"create index idx_bloom_$tableName on $tableName using 
bloom_filters(city) options(expr='upper')")
+        spark.sql(s"create index idx_bloom_$tableName on $tableName using 
bloom_filters(city) options(expr='upper', " +
+          s"${HoodieExpressionIndex.FALSE_POSITIVE_RATE}='0.01', 
${HoodieExpressionIndex.BLOOM_FILTER_TYPE}='SIMPLE', 
${HoodieExpressionIndex.BLOOM_FILTER_NUM_ENTRIES}='1000')")
+        var metaClient = createMetaClient(spark, basePath)
+        
assertTrue(metaClient.getTableConfig.getMetadataPartitions.contains(s"expr_index_idx_bloom_$tableName"))
+        assertTrue(metaClient.getIndexMetadata.isPresent)
+        assertEquals(2, 
metaClient.getIndexMetadata.get.getIndexDefinitions.size())
+        val indexDefinition: HoodieIndexDefinition = 
metaClient.getIndexMetadata.get.getIndexDefinitions.get(s"expr_index_idx_bloom_$tableName")
+        // validate index options
+        assertEquals("0.01", 
indexDefinition.getIndexOptions.get(HoodieExpressionIndex.FALSE_POSITIVE_RATE))
+        assertEquals("SIMPLE", 
indexDefinition.getIndexOptions.get(HoodieExpressionIndex.BLOOM_FILTER_TYPE))
+        assertEquals("1000", 
indexDefinition.getIndexOptions.get(HoodieExpressionIndex.BLOOM_FILTER_NUM_ENTRIES))
+
+        // validate index metadata
+        val indexMetadataDf = spark.sql(s"select key, BloomFilterMetadata from 
hudi_metadata('$tableName') where BloomFilterMetadata is not null")
+        assertEquals(4, indexMetadataDf.count()) // corresponding to 4 riders

Review Comment:
   ```suggestion
           assertEquals(4, indexMetadataDf.count()) // corresponding to 4 files
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-8768] Support bloom filter options when creating expr index using bloom filter [hudi]

Reply via email to