Aditya Goenka created HUDI-7481:
-----------------------------------
Summary: schemacommit file increases with every commit and
ultimately failing with OOM
Key: HUDI-7481
URL: https://issues.apache.org/jira/browse/HUDI-7481
Project: Apache Hudi
Issue Type: Bug
Components: writer-core
Reporter: Aditya Goenka
Fix For: 1.1.0
schemacommit file grows with every commit even without any schema change, as it
keeps all the historical versions. At one point the job starts failing with OOM
exception due to this.
Below is the reproducible code -
```
basePath = "file:///tmp/hudi_cow_read"
streamingTableName = "hudi_trips_cow_streaming"
baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"
hudi_streaming_options = {
'hoodie.table.name': streamingTableName,
'hoodie.datasource.write.recordkey.field' : 'uuid',
'hoodie.datasource.write.precombine.field' : 'ts',
'hoodie.datasource.write.partitionpath.field': 'city',
'hoodie.datasource.write.table.name': streamingTableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.schema.on.read.enable' : 'true',
'hoodie.datasource.write.reconcile.schema' : 'true',
'hoodie.datasource.write.drop.partition.columns' : 'true',
'hoodie.datasource.write.hive_style_partitioning' : 'true'
}
# create streaming df
df = spark.readStream \
.format("hudi").option("hoodie.datasource.read.incr.fallback.fulltablescan.enable",
"true") \
.load(basePath).select("ts","uuid","rider","driver","fare","city")
# write stream to new hudi table
df.writeStream.format("hudi") \
.options(**hudi_streaming_options) \
.outputMode("append") \
.option("path", baseStreamingPath) \
.option("checkpointLocation", checkpointLocation) \
.trigger(processingTime='10 seconds') \
.start() \
.awaitTermination()
```
Github Issue - [https://github.com/apache/hudi/issues/10816]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)