Aditya Goenka created HUDI-7481:
-----------------------------------

             Summary: schemacommit file increases with every commit and 
ultimately failing with OOM
                 Key: HUDI-7481
                 URL: https://issues.apache.org/jira/browse/HUDI-7481
             Project: Apache Hudi
          Issue Type: Bug
          Components: writer-core
            Reporter: Aditya Goenka
             Fix For: 1.1.0


schemacommit file grows with every commit even without any schema change, as it 
keeps all the historical versions. At one point the job starts failing with OOM 
exception due to this.

 

Below is the reproducible code - 

```
basePath = "file:///tmp/hudi_cow_read"
streamingTableName = "hudi_trips_cow_streaming"
baseStreamingPath = "file:///tmp/hudi_trips_cow_streaming"
checkpointLocation = "file:///tmp/checkpoints/hudi_trips_cow_streaming"

hudi_streaming_options = {
'hoodie.table.name': streamingTableName,
'hoodie.datasource.write.recordkey.field' : 'uuid',
'hoodie.datasource.write.precombine.field' : 'ts',
'hoodie.datasource.write.partitionpath.field': 'city',
'hoodie.datasource.write.table.name': streamingTableName,
'hoodie.datasource.write.operation': 'upsert',
'hoodie.upsert.shuffle.parallelism': 2,
'hoodie.insert.shuffle.parallelism': 2,
'hoodie.schema.on.read.enable' : 'true',
'hoodie.datasource.write.reconcile.schema' : 'true',
'hoodie.datasource.write.drop.partition.columns' : 'true',
'hoodie.datasource.write.hive_style_partitioning' : 'true'
}

# create streaming df
df = spark.readStream \
.format("hudi").option("hoodie.datasource.read.incr.fallback.fulltablescan.enable",
 "true") \
.load(basePath).select("ts","uuid","rider","driver","fare","city")

# write stream to new hudi table
df.writeStream.format("hudi") \
.options(**hudi_streaming_options) \
.outputMode("append") \
.option("path", baseStreamingPath) \
.option("checkpointLocation", checkpointLocation) \
.trigger(processingTime='10 seconds') \
.start() \
.awaitTermination()
```
Github Issue - [https://github.com/apache/hudi/issues/10816]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to