Github user arunmahadevan commented on the issue:
https://github.com/apache/spark/pull/21500
Clearing the map after each commit might make things worse, since the maps
needs to be loaded from the snapshot + delta files for the next micro-batch.
Setting `spark.sql.streaming.minBatchesToRetain` to a lower value might address
the memory consumption to some extend.
Maybe we need to explore how to avoid maintaining multiple copies of the
state in memory within HDFS state store or even explore Rocks DB for
incremental checkpointing.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]