[GitHub] spark pull request #21700: SPARK-24717 Split out min retain version of state...

HeartSaVioR Mon, 02 Jul 2018 15:23:43 -0700

GitHub user HeartSaVioR opened a pull request:

    https://github.com/apache/spark/pull/21700


    SPARK-24717 Split out min retain version of state for memory in 
HDFSBackedStateStoreProvider

    ## What changes were proposed in this pull request?
    
    This patch proposes breaking down configuration of retaining batch size on 
state into two pieces: files and in memory (cache). While this patch reuses 
existing configuration for files, it introduces new configuration, 
"spark.sql.streaming.maxBatchesToRetainInMemory" to configure max count of 
batch to retain in memory.
    
    This patch also introduces BoundedSortedMap to retain at most first N 
elements (sorted by key) which can be leveraged in loadedMaps in 
HDFSBackedStateStoreProvider.
    
    ## How was this patch tested?
    
    Apply this patch on top of SPARK-24441 
(https://github.com/apache/spark/pull/21469), and manually tested to ensure 
overall size of state is around 2x or less instead of 10x ~ 80x according to 
various workloads.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HeartSaVioR/spark SPARK-24717

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21700.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21700
    
----
commit 22f0e220f661b5457584ef83b1ecddc18212fa73
Author: Jungtaek Lim <kabhwan@...>
Date:   2018-07-02T22:04:49Z

    SPARK-24717 Split out min retain version of state for memory in 
HDFSBackedStateStoreProvider
    
    * introduce BoundedSortedMap which implements bounded size of sorted map
      * only first N elements will be retained
    * replace loadedMaps to BoundedSortedMap to retain only N versions of states
      * no need to cleanup in maintenance phase
    * introduce new configuration: 
spark.sql.streaming.minBatchesToRetainInMemory

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21700: SPARK-24717 Split out min retain version of state...

Reply via email to