GitHub user HeartSaVioR opened a pull request:
https://github.com/apache/spark/pull/21700
SPARK-24717 Split out min retain version of state for memory in
HDFSBackedStateStoreProvider
## What changes were proposed in this pull request?
This patch proposes breaking down configuration of retaining batch size on
state into two pieces: files and in memory (cache). While this patch reuses
existing configuration for files, it introduces new configuration,
"spark.sql.streaming.maxBatchesToRetainInMemory" to configure max count of
batch to retain in memory.
This patch also introduces BoundedSortedMap to retain at most first N
elements (sorted by key) which can be leveraged in loadedMaps in
HDFSBackedStateStoreProvider.
## How was this patch tested?
Apply this patch on top of SPARK-24441
(https://github.com/apache/spark/pull/21469), and manually tested to ensure
overall size of state is around 2x or less instead of 10x ~ 80x according to
various workloads.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HeartSaVioR/spark SPARK-24717
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21700.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21700
----
commit 22f0e220f661b5457584ef83b1ecddc18212fa73
Author: Jungtaek Lim <kabhwan@...>
Date: 2018-07-02T22:04:49Z
SPARK-24717 Split out min retain version of state for memory in
HDFSBackedStateStoreProvider
* introduce BoundedSortedMap which implements bounded size of sorted map
* only first N elements will be retained
* replace loadedMaps to BoundedSortedMap to retain only N versions of states
* no need to cleanup in maintenance phase
* introduce new configuration:
spark.sql.streaming.minBatchesToRetainInMemory
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]