Jungtaek Lim created SPARK-24717:
------------------------------------
Summary: Split out min retain version of state for memory in
HDFSBackedStateStoreProvider
Key: SPARK-24717
URL: https://issues.apache.org/jira/browse/SPARK-24717
Project: Spark
Issue Type: Bug
Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Jungtaek Lim
HDFSBackedStateStoreProvider has only one configuration for minimum versions to
retain of state which applies to both memory cache and files. As default
version of "spark.sql.streaming.minBatchesToRetain" is set to high (100), which
doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of memory
consumption for various workloads. In addition, in some cases, requiring 2x of
memory is even unacceptable, so we should split out configuration for memory
and let users adjust to trade-off memory usage vs cache miss.
In normal case, default value '2' would cover both cases: success and restoring
failure with less than or around 2x of memory usage, and '1' would only cover
success case but no longer require more than 1x of memory. In extreme case,
user can set the value to '0' to completely disable the map cache to maximize
executor memory.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]