parisni commented on issue #9026:
URL: https://github.com/apache/hudi/issues/9026#issuecomment-1627681838
@yihua
> If the metadata table is queried through Spark datasource directly after
MDT compaction (i.e., no additional log file in the latest file slice), there
is no duplicate.
Did you add new partition during that step ? It turns out the duplication
occurs when new partitions are added after compaction. see below: when no new
partitions, no duplication. When new partitions, then it gets tons of
duplicates.
```python
sc.setLogLevel("ERROR")
tableName = 'test_corrupted_mdt'
basePath = "/tmp/{tableName}".format(tableName=tableName)
hudi_options = {
"hoodie.table.name": tableName,
"hoodie.datasource.write.recordkey.field": "event_id",
"hoodie.datasource.write.partitionpath.field": "part",
"hoodie.datasource.write.table.name": tableName,
"hoodie.datasource.write.operation": "upsert",
"hoodie.datasource.write.precombine.field": "ts",
"hoodie.datasource.write.keygenerator.class":
"org.apache.hudi.keygen.ComplexKeyGenerator",
"hoodie.datasource.write.hive_style_partitioning": "true",
"hoodie.datasource.hive_sync.enable": "false",
"hoodie.metadata.enable": "true",
}
mode="overwrite"
for i in range(1,22):
df =spark.sql("select '1' as event_id, '2' as ts, '"+str(i)+"' as part")
# <-- W/ adding new partitions
# df =spark.sql("select '1' as event_id, '2' as ts, '2' as part") <--
W/O adding new partitions
(df.write.format("hudi").options(**hudi_options).mode(mode).save(basePath))
mode="append"
ct = spark.read.format("hudi").load(basePath +
"/.hoodie/metadata").count()
print("NB:"+str(ct) + " for iteration:" + str(i))
NB:2 for iteration:1
NB:3 for iteration:2
NB:4 for iteration:3
NB:5 for iteration:4
NB:6 for iteration:5
NB:7 for iteration:6
NB:8 for iteration:7
NB:9 for iteration:8
NB:10 for iteration:9
NB:21 for iteration:10 <--- MDT COMPACTION
NB:32 for iteration:11
NB:43 for iteration:12
NB:54 for iteration:13
NB:65 for iteration:14
NB:76 for iteration:15
NB:87 for iteration:16
NB:98 for iteration:17
NB:109 for iteration:18
NB:120 for iteration:19
NB:41 for iteration:20 <--- MDT COMPACTION
NB:62 for iteration:21
NB:2 for iteration:1
NB:2 for iteration:2
NB:2 for iteration:3
NB:2 for iteration:4
NB:2 for iteration:5
NB:2 for iteration:6
NB:2 for iteration:7
NB:2 for iteration:8
NB:2 for iteration:9
NB:2 for iteration:10 <--- MDT COMPACTION
NB:2 for iteration:11
NB:2 for iteration:12
NB:2 for iteration:13
NB:2 for iteration:14
NB:2 for iteration:15
NB:2 for iteration:16
NB:2 for iteration:17
NB:2 for iteration:18
NB:2 for iteration:19
NB:2 for iteration:20 <--- MDT COMPACTION
NB:2 for iteration:21
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]