hudi-bot opened a new issue, #17365:
URL: https://github.com/apache/hudi/issues/17365
Based on our analysis, drop partition support is broken in 0.15.0 for multi
partition fields.
For nested field, it is swapping a field with the same name but different
path with the partition value
For timestamp issue, the field gets replaced with the partition value
instead of the value in the file (for example:
{{{}timestamp_micros_nullable_field":"2025-01-25T00:00:00.000Z"{}}})
Also seeing a regression on drop partition where the dropped partition is
still being read
The replace commit is not being written correctly in 0.15.0, the
{{partitionToReplaceFileIds}} contains a map with an empty list instead of the
filegroup ids for the partition
We need a fix for 0.15.0.
1.0 works fine. See the test script
https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-8928
- Type: Sub-task
- Parent: https://issues.apache.org/jira/browse/HUDI-9113
- Fix version(s):
- 0.15.1
- Attachment(s):
- 28/Jan/25 10:44;codope;Screenshot 2025-01-28 at 4.12.49
PM.png;https://issues.apache.org/jira/secure/attachment/13074315/Screenshot+2025-01-28+at+4.12.49%E2%80%AFPM.png
---
## Comments
28/Jan/25 10:45;codope;With 1.0, ran a simple DELETE_PARTITION test and I
can see that `partitionToReplaceFileIds` is populated correctly.
!Screenshot 2025-01-28 at 4.12.49 PM.png|width=990,height=106!
SQL also works: `spark.sql(s"alter table $tableName drop partition
(dt='2021-10-01')")`;;;
---
28/Jan/25 12:18;codope;Ran
[TestSparkSqlWithCustomKeyGenerator|https://github.com/apache/hudi/blob/02472c91aac1892d76602795c3f816b58e9c90f7/hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkSqlWithCustomKeyGenerator.scala#L255]
- could not repro.
{code:java}
val df = spark.sql(
s"""SELECT 1 as id, 'a1' as name, 1.6 as price, 1704121827 as ts, 'cat1'
as segment
| UNION
| SELECT 2 as id, 'a2' as name, 10.8 as price, 1704121827 as ts, 'cat1'
as segment
| UNION
| SELECT 3 as id, 'a3' as name, 30.0 as price, 1706800227 as ts, 'cat1'
as segment
| UNION
| SELECT 4 as id, 'a4' as name, 103.4 as price, 1701443427 as ts,
'cat2' as segment
| UNION
| SELECT 5 as id, 'a5' as name, 1999.0 as price, 1704121827 as ts,
'cat2' as segment
| UNION
| SELECT 6 as id, 'a6' as name, 80.0 as price, 1704121827 as ts, 'cat3'
as segment
|""".stripMargin)
df.write.format("hudi")
.option("hoodie.datasource.write.table.type", tableType)
.option("hoodie.datasource.write.keygenerator.class",
"org.apache.hudi.keygen.CustomKeyGenerator")
.option("hoodie.datasource.write.partitionpath.field",
"ts:timestamp,segment:simple")
.option("hoodie.datasource.write.recordkey.field", "id")
.option("hoodie.datasource.write.precombine.field", "name")
.option("hoodie.table.name", tableName)
.option("hoodie.insert.shuffle.parallelism", "1")
.option("hoodie.upsert.shuffle.parallelism", "1")
.option("hoodie.bulkinsert.shuffle.parallelism", "1")
.option("hoodie.keygen.timebased.timestamp.type", "SCALAR")
.option("hoodie.keygen.timebased.output.dateformat", "yyyyMM")
.option("hoodie.keygen.timebased.timestamp.scalar.time.unit", "seconds")
.mode(SaveMode.Overwrite)
.save(tablePath)
spark.read.format("hudi").load(tablePath).show(false)
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+
|_hoodie_commit_time|_hoodie_commit_seqno
|_hoodie_record_key|_hoodie_partition_path|_hoodie_file_name
|id |name|price |ts |segment|
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+
|20250128120421899 |20250128120421899_0_0|2 |202401/cat1
|390ee34b-6466-46fb-99da-9c7010d87413-0_0-181-278_20250128120421899.parquet|2
|a2 |10.8 |1704121827|cat1 |
|20250128120421899 |20250128120421899_0_1|1 |202401/cat1
|390ee34b-6466-46fb-99da-9c7010d87413-0_0-181-278_20250128120421899.parquet|1
|a1 |1.6 |1704121827|cat1 |
|20250128120421899 |20250128120421899_2_0|6 |202401/cat3
|0b744252-3504-47f6-83b5-36a8ad1d9bbd-0_2-181-280_20250128120421899.parquet|6
|a6 |80.0 |1704121827|cat3 |
|20250128120421899 |20250128120421899_4_0|4 |202312/cat2
|99e01331-1443-4058-bca7-c35cd56b3c77-0_4-181-282_20250128120421899.parquet|4
|a4 |103.4 |1701443427|cat2 |
|20250128120421899 |20250128120421899_3_0|3 |202402/cat1
|3b33ac51-c756-4b04-af35-cb358f9ba80a-0_3-181-281_20250128120421899.parquet|3
|a3 |30.0 |1706800227|cat1 |
|20250128120421899 |20250128120421899_1_0|5 |202401/cat2
|3c8ec017-883d-4ba5-b550-437ee3cc2246-0_1-181-279_20250128120421899.parquet|5
|a5 |1999.0|1704121827|cat2 |
+-------------------+---------------------+------------------+----------------------+--------------------------------------------------------------------------+---+----+------+----------+-------+{code}
As you can see, `hoodie_partition_path` is in the specified output format,
while `ts` field is same as the input data.;;;
---
28/Jan/25 12:36;codope;Timestamp partitioning issue reproducible with
0.15.0. Here's a simple script to repro the issue -
[https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071];;;
---
28/Jan/25 15:22;codope;The replacecommit looks fine in both 0.15.0 and
master branch. All tests updated in this gist -
https://gist.github.com/codope/dfb87d35112ddbb7f207ad3f52320071;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]