[I] Fix timestamp based partitioning and drop partition support with 0.15.0 [hudi]

via GitHub Sun, 30 Nov 2025 05:53:19 -0800


hudi-bot opened a new issue, #17365:
URL: https://github.com/apache/hudi/issues/17365


   Based on our analysis, drop partition support is broken in 0.15.0 for multi 
partition fields. 
   
    
   
   For nested field, it is swapping a field with the same name but different 
path with the partition value
   
   For timestamp issue, the field gets replaced with the partition value 
instead of the value in the file (for example: 
{{{}timestamp_micros_nullable_field":"2025-01-25T00:00:00.000Z"{}}})
   
   Also seeing a regression on drop partition where the dropped partition is 
still being read
   
   The replace commit is not being written correctly in 0.15.0, the 
{{partitionToReplaceFileIds}} contains a map with an empty list instead of the 
filegroup ids for the partition
   
    
   
   We need a fix for 0.15.0.  
   
   1.0 works fine. See the test script 
https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-8928
   - Type: Sub-task
   - Parent: https://issues.apache.org/jira/browse/HUDI-9113
   - Fix version(s):
     - 0.15.1
   - Attachment(s):
     - 28/Jan/25 10:44;codope;Screenshot 2025-01-28 at 4.12.49 
PM.png;https://issues.apache.org/jira/secure/attachment/13074315/Screenshot+2025-01-28+at+4.12.49%E2%80%AFPM.png
   
   
   ---
   
   
   ## Comments
   
   28/Jan/25 10:45;codope;With 1.0, ran a simple DELETE_PARTITION test and I 
can see that `partitionToReplaceFileIds` is populated correctly.
   
   !Screenshot 2025-01-28 at 4.12.49 PM.png|width=990,height=106!
   
    
   
   SQL also works: `spark.sql(s"alter table $tableName drop partition 
(dt='2021-10-01')")`;;;
   
   ---
   
   28/Jan/25 12:18;codope;Ran 
[TestSparkSqlWithCustomKeyGenerator|https://github.com/apache/hudi/blob/02472c91aac1892d76602795c3f816b58e9c90f7/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlWithCustomKeyGenerator.scala#L255]
 - could not repro.
   {code:java}
   val df = spark.sql(
     s"""SELECT 1 as id, 'a1' as name, 1.6 as price, 1704121827 as ts, 'cat1' 
as segment
        | UNION
        | SELECT 2 as id, 'a2' as name, 10.8 as price, 1704121827 as ts, 'cat1' 
as segment
        | UNION
        | SELECT 3 as id, 'a3' as name, 30.0 as price, 1706800227 as ts, 'cat1' 
as segment
        | UNION
        | SELECT 4 as id, 'a4' as name, 103.4 as price, 1701443427 as ts, 
'cat2' as segment
        | UNION
        | SELECT 5 as id, 'a5' as name, 1999.0 as price, 1704121827 as ts, 
'cat2' as segment
        | UNION
        | SELECT 6 as id, 'a6' as name, 80.0 as price, 1704121827 as ts, 'cat3' 
as segment
        |""".stripMargin)
   
   df.write.format("hudi")
     .option("hoodie.datasource.write.table.type", tableType)
     .option("hoodie.datasource.write.keygenerator.class", 
"org.apache.hudi.keygen.CustomKeyGenerator")
     .option("hoodie.datasource.write.partitionpath.field", 
"ts:timestamp,segment:simple")
     .option("hoodie.datasource.write.recordkey.field", "id")
     .option("hoodie.datasource.write.precombine.field", "name")
     .option("hoodie.table.name", tableName)
     .option("hoodie.insert.shuffle.parallelism", "1")
     .option("hoodie.upsert.shuffle.parallelism", "1")
     .option("hoodie.bulkinsert.shuffle.parallelism", "1")
     .option("hoodie.keygen.timebased.timestamp.type", "SCALAR")
     .option("hoodie.keygen.timebased.output.dateformat", "yyyyMM")
     .option("hoodie.keygen.timebased.timestamp.scalar.time.unit", "seconds")
     .mode(SaveMode.Overwrite)
     .save(tablePath) 
   
   spark.read.format("hudi").load(tablePath).show(false)
   
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+
   |_hoodie_commit_time|_hoodie_commit_seqno 
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name                    
                                     |id |name|price |ts        |segment|
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+
   |20250128120421899  |20250128120421899_0_0|2                 |202401/cat1    
       
|390ee34b-6466-46fb-99da-9c7010d87413-0_0-181-278_20250128120421899.parquet|2  
|a2  |10.8  |1704121827|cat1   |
   |20250128120421899  |20250128120421899_0_1|1                 |202401/cat1    
       
|390ee34b-6466-46fb-99da-9c7010d87413-0_0-181-278_20250128120421899.parquet|1  
|a1  |1.6   |1704121827|cat1   |
   |20250128120421899  |20250128120421899_2_0|6                 |202401/cat3    
       
|0b744252-3504-47f6-83b5-36a8ad1d9bbd-0_2-181-280_20250128120421899.parquet|6  
|a6  |80.0  |1704121827|cat3   |
   |20250128120421899  |20250128120421899_4_0|4                 |202312/cat2    
       
|99e01331-1443-4058-bca7-c35cd56b3c77-0_4-181-282_20250128120421899.parquet|4  
|a4  |103.4 |1701443427|cat2   |
   |20250128120421899  |20250128120421899_3_0|3                 |202402/cat1    
       
|3b33ac51-c756-4b04-af35-cb358f9ba80a-0_3-181-281_20250128120421899.parquet|3  
|a3  |30.0  |1706800227|cat1   |
   |20250128120421899  |20250128120421899_1_0|5                 |202401/cat2    
       
|3c8ec017-883d-4ba5-b550-437ee3cc2246-0_1-181-279_20250128120421899.parquet|5  
|a5  |1999.0|1704121827|cat2   |
   
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+{code}
   As you can see, `hoodie_partition_path` is in the specified output format, 
while `ts` field is same as the input data.;;;
   
   ---
   
   28/Jan/25 12:36;codope;Timestamp partitioning issue reproducible with 
0.15.0. Here's a simple script to repro the issue - 
[https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071];;;
   
   ---
   
   28/Jan/25 15:22;codope;The replacecommit looks fine in both 0.15.0 and 
master branch. All tests updated in this gist - 
https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Fix timestamp based partitioning and drop partition support with 0.15.0 [hudi]

Reply via email to