Re: [PR] [HUDI-5216] Implement file-level stats update for index function [hudi]

via GitHub Tue, 08 Oct 2024 17:50:23 -0700


nsivabalan commented on code in PR #12069:
URL: https://github.com/apache/hudi/pull/12069#discussion_r1792409471



##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1053,7 +1053,7 @@ public void update(HoodieCommitMetadata commitMetadata, 
HoodieData<HoodieRecord>
    * Update functional index from {@link HoodieCommitMetadata}.
    */
   private void updateFunctionalIndexIfPresent(HoodieCommitMetadata 
commitMetadata, String instantTime, Map<String, HoodieData<HoodieRecord>> 
partitionToRecordMap) {
-    if (!dataWriteConfig.getMetadataConfig().isFunctionalIndexEnabled()) {
+    if 
(!MetadataPartitionType.FUNCTIONAL_INDEX.isMetadataPartitionAvailable(dataMetaClient))
 {

Review Comment:
   an unrelated minor fix. 
   in SparkHoodieBackedTableMetadataWriter
   ```
    @Override
     protected HoodieData<HoodieRecord> 
getFunctionalIndexRecords(List<Pair<String, FileSlice>> partitionFileSlicePairs,
                                                                  
HoodieIndexDefinition indexDefinition,
                                                                  
HoodieTableMetaClient metaClient, int parallelism,
                                                                  Schema 
readerSchema, StorageConfiguration<?> storageConf) {
       HoodieFunctionalIndex<Column, Column> functionalIndex =
           new HoodieSparkFunctionalIndex(indexDefinition.getIndexName(), 
indexDefinition.getIndexFunction(), indexDefinition.getSourceFields(), 
indexDefinition.getIndexOptions());
       HoodieSparkEngineContext sparkEngineContext = (HoodieSparkEngineContext) 
engineContext;
       if (indexDefinition.getSourceFields().isEmpty()) {
         // In case there are no columns to index, bail
         return sparkEngineContext.emptyHoodieData();
       }
       .
   .
   ```
   
   We could return right away at the beginning. why initialize 
HoodieSparkFunctionalIndex 



##########
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/command/index/TestFunctionalIndex.scala:
##########
@@ -424,6 +424,66 @@ class TestFunctionalIndex extends HoodieSparkSqlTestBase {
     }
   }
 
+  test("Test Functional Index File-level Stats Update") {
+    if (HoodieSparkUtils.gteqSpark3_3) {
+      withTempDir { tmp =>
+        // create a simple partitioned mor table and insert some records
+        val tableName = generateTableName
+        val basePath = s"${tmp.getCanonicalPath}/$tableName"
+        spark.sql(
+          s"""
+             |create table $tableName (
+             |  id int,
+             |  price double,
+             |  ts long,
+             |  name string
+             |) using hudi
+             | options (
+             |  primaryKey ='id',
+             |  type = 'mor',
+             |  preCombineField = 'ts'
+             | )
+             | partitioned by(name)
+             | location '$basePath'
+       """.stripMargin)
+        // a record with from_unixtime(ts, 'yyyy-MM-dd') = 2020-09-26
+        spark.sql(s"insert into $tableName values(1, 10, 1601098924, 'a1')")
+        // a record with from_unixtime(ts, 'yyyy-MM-dd') = 2021-09-26
+        spark.sql(s"insert into $tableName values(2, 10, 1632634924, 'a1')")
+        // a record with from_unixtime(ts, 'yyyy-MM-dd') = 2022-09-26
+        spark.sql(s"insert into $tableName values(3, 10, 1664170924, 'a2')")
+        // create functional index and verify
+        spark.sql(s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')")
+        val metaClient = createMetaClient(spark, basePath)
+        
assertTrue(metaClient.getTableConfig.getMetadataPartitions.contains("func_index_idx_datestr"))
+        assertTrue(metaClient.getIndexMetadata.isPresent)
+        assertEquals(1, 
metaClient.getIndexMetadata.get.getIndexDefinitions.size())
+
+        // verify functional index records by querying metadata table
+        val metadataSql = s"select ColumnStatsMetadata.minValue.member6.value, 
ColumnStatsMetadata.maxValue.member6.value from hudi_metadata('$tableName') 
where type=3"
+        checkAnswer(metadataSql)(
+          Seq("2020-09-26", "2021-09-26"), // for file in name=a1
+          Seq("2022-09-26", "2022-09-26") // for file in name
+        )
+
+        // do another insert after initializing the index
+        // a record with from_unixtime(ts, 'yyyy-MM-dd') = 2024-09-26
+        spark.sql(s"insert into $tableName values(5, 10, 1727329324, 'a3')")
+        // check query result for predicates including values when functional 
index was disabled
+        checkAnswer(s"select id, name from $tableName where from_unixtime(ts, 
'yyyy-MM-dd') IN ('2024-09-26', '2022-09-26')")(
+          Seq(3, "a2"),
+          Seq(5, "a3")
+        )
+        // verify there are new updates to functional index
+        checkAnswer(metadataSql)(
+          Seq("2020-09-26", "2021-09-26"),
+          Seq("2022-09-26", "2022-09-26"),
+          Seq("2024-09-26", "2024-09-26") // for file in name=a3
+        )
+      }
+    }
+  }
+

Review Comment:
   is this the only class that has functional index tests? 
   this looks very shallow test coverage wrt a new index :( 
   
   Can we file a follow up ticket and come up w/ a test plan for functional 
index. I would like to review and get it in. 
   
   I can help w/ the test plan too if needed. 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-5216] Implement file-level stats update for index function [hudi]

Reply via email to