Github user arunmahadevan commented on the issue:

    https://github.com/apache/spark/pull/21500
  
    Clearing the map after each commit might make things worse, since the maps 
needs to be loaded from the snapshot + delta files for the next micro-batch. 
Setting `spark.sql.streaming.minBatchesToRetain` to a lower value might address 
the memory consumption to some extend. 
    
    Maybe we need to explore how to avoid maintaining multiple copies of the 
state in memory within HDFS state store or even explore Rocks DB for 
incremental checkpointing.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to