[
https://issues.apache.org/jira/browse/SPARK-24717?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529962#comment-16529962
]
Jungtaek Lim commented on SPARK-24717:
--------------------------------------
I have a patch for this, but now on top of SPARK-24441. I'll try to decouple
and rebase to master branch, and provide a pull request.
> Split out min retain version of state for memory in
> HDFSBackedStateStoreProvider
> --------------------------------------------------------------------------------
>
> Key: SPARK-24717
> URL: https://issues.apache.org/jira/browse/SPARK-24717
> Project: Spark
> Issue Type: Bug
> Components: Structured Streaming
> Affects Versions: 2.4.0
> Reporter: Jungtaek Lim
> Priority: Major
>
> HDFSBackedStateStoreProvider has only one configuration for minimum versions
> to retain of state which applies to both memory cache and files. As default
> version of "spark.sql.streaming.minBatchesToRetain" is set to high (100),
> which doesn't require strictly 100x of memory, but I'm seeing 10x ~ 80x of
> memory consumption for various workloads. In addition, in some cases,
> requiring 2x of memory is even unacceptable, so we should split out
> configuration for memory and let users adjust to trade-off memory usage vs
> cache miss.
> In normal case, default value '2' would cover both cases: success and
> restoring failure with less than or around 2x of memory usage, and '1' would
> only cover success case but no longer require more than 1x of memory. In
> extreme case, user can set the value to '0' to completely disable the map
> cache to maximize executor memory.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]