[GitHub] [hudi] parisni opened a new issue, #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource

via GitHub Tue, 20 Jun 2023 12:16:14 -0700


parisni opened a new issue, #9026:
URL: https://github.com/apache/hudi/issues/9026


   hudi >= 0.11 (including 0.13.1)
   
   I noticed we have  duplicates in our metadata tables :
   ```
   >>> 
spark.read.format("hudi").load("/tmp/metadata").filter("key='version=2/event_date=2009-12-03/event_hour=08'").select("key","filesystemMetadata").show(10,
 False,True)
   -RECORD 
0-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
1-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
2-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
3-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
4-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
5-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
6-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   -RECORD 
7-------------------------------------------------------------------------------------------------------------------
    key                | version=2/event_date=2009-12-03/event_hour=08
    filesystemMetadata | 
{3f5fc9bd-2d4e-4c0a-8b9b-f29dbcefc579-0_14818-18-113282_20220728200748277.parquet
 -> {445028, false}}
   ```
   
   Interestingly, I also have 1 hfile and 7 logs. On our other tables, the 
number of duplicated partitions is equal to the number of logs files.
   ```
   ls /tmp/metadata/files/
   .hoodie_partition_metadata
   files-0000_0-16-519_20230620071307473001.hfile
   .files-0000_20230620071307473001.log.1_0-23-725
   .files-0000_20230620071307473001.log.2_0-16-719
   .files-0000_20230620071307473001.log.3_0-16-721
   .files-0000_20230620071307473001.log.4_0-16-723
   .files-0000_20230620071307473001.log.5_0-16-724
   .files-0000_20230620071307473001.log.6_0-16-727
   .files-0000_20230620071307473001.log.7_0-16-729
   ```
   
   Here is a reproductible script. after reaching the compaction number, the 
mdt sudently gets duplicates when reading:
   
   ```python
   tableName = 'test_corrupted_mdt'
   basePath = "/tmp/{tableName}".format(tableName=tableName)
   
   hudi_options = {
       "hoodie.table.name": tableName,
       "hoodie.datasource.write.recordkey.field": "event_id",
       "hoodie.datasource.write.partitionpath.field": "version,event_date",
       "hoodie.datasource.write.table.name": tableName,
       "hoodie.datasource.write.operation": "upsert",
       "hoodie.datasource.write.precombine.field": "ts",
       "hoodie.datasource.write.keygenerator.class": 
"org.apache.hudi.keygen.ComplexKeyGenerator",
       "hoodie.datasource.write.hive_style_partitioning": "true",
       "hoodie.datasource.hive_sync.enable": "false",
       "hoodie.metadata.enable": "true",
   }
   mode="overwrite"
   for i in range(1,11):
       df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as 
version, 'foo' as event_date")
       
(df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath))
       mode="append"
   
   spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
   spark.read.format("hudi").load(basePath + 
"/.hoodie/metadata").select("key").dropDuplicates().count()
   
   >>> spark.read.format("hudi").load(basePath + "/.hoodie/metadata").count()
   21
   >>> spark.read.format("hudi").load(basePath + 
"/.hoodie/metadata").select("key").dropDuplicates().count()
   11
   
   ``` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [hudi] parisni opened a new issue, #9026: [SUPPORT] Duplicated partitions rows in MDT when reading w/ datasource

Reply via email to