Jungtaek Lim created SPARK-24717:
------------------------------------

             Summary: Split out min retain version of state for memory in 
HDFSBackedStateStoreProvider
                 Key: SPARK-24717
                 URL: https://issues.apache.org/jira/browse/SPARK-24717
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.4.0
            Reporter: Jungtaek Lim


HDFSBackedStateStoreProvider has only one configuration for minimum versions to 
retain of state which applies to both memory cache and files. As default 
version of "spark.sql.streaming.minBatchesToRetain" is set to high (100), which 
doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of memory 
consumption for various workloads. In addition, in some cases, requiring 2x of 
memory is even unacceptable, so we should split out configuration for memory 
and let users adjust to trade-off memory usage vs cache miss.

In normal case, default value '2' would cover both cases: success and restoring 
failure with less than or around 2x of memory usage, and '1' would only cover 
success case but no longer require more than 1x of memory. In extreme case, 
user can set the value to '0' to completely disable the map cache to maximize 
executor memory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to