parisni opened a new issue, #9026:
URL: https://github.com/apache/hudi/issues/9026
hudi >= 0.11 (including 0.13.1)
I noticed we have duplicates in our metadata tables :
```
>>>
spark.read.format("hudi").load("/tmp/metadata").filter("key='version=2/event_date=2009-12-03/event_hour=08'").select("key","filesystemMetadata").show(10,
False,True)
-RECORD
0-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
1-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
2-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
3-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
4-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
5-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
6-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
-RECORD
7-------------------------------------------------------------------------------------------------------------------
key | version=2/event_date=2009-12-03/event_hour=08
filesystemMetadata |
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
-> {445028, false}}
```
Interestingly, I also have 1 hfile and 7 logs. On our other tables, the
number of duplicated partitions is equal to the number of logs files.
```
ls /tmp/metadata/files/
.hoodie_partition_metadata
files-0000_0-16-519_20230620071307473001.hfile
.files-0000_20230620071307473001.log.1_0-23-725
.files-0000_20230620071307473001.log.2_0-16-719
.files-0000_20230620071307473001.log.3_0-16-721
.files-0000_20230620071307473001.log.4_0-16-723
.files-0000_20230620071307473001.log.5_0-16-724
.files-0000_20230620071307473001.log.6_0-16-727
.files-0000_20230620071307473001.log.7_0-16-729
```
Here is a reproductible script. after reaching the compaction number, the
mdt sudently gets duplicates when reading:
```python
tableName = 'test_corrupted_mdt'
basePath = "/tmp/{tableName}".format(tableName=tableName)
hudi_options = {
"hoodie.table.name": tableName,
"hoodie.datasource.write.recordkey.field": "event_id",
"hoodie.datasource.write.partitionpath.field": "version,event_date",
"hoodie.datasource.write.table.name": tableName,
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.metadata.enable": "true",
}
mode="overwrite"
for i in range(1,11):
df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as
version, 'foo' as event_date")
(df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath))
mode="append"
spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
spark.read.format("hudi").load(basePath +
"/.hoodie/metadata").select("key").dropDuplicates().count()
>>> spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
21
>>> spark.read.format("hudi").load(basePath +
"/.hoodie/metadata").select("key").dropDuplicates().count()
11
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]