GitHub user tcondie opened a pull request:

    https://github.com/apache/spark/pull/16219

    [SPARK-18790][SS] Keep a general offset history of stream batches

    ## What changes were proposed in this pull request?
    
    Instead of only keeping the minimum number of offsets around, we should 
keep enough information to allow us to roll back n batches and reexecute the 
stream starting from a given point. In particular, we should create a config in 
SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure 
that we keep enough log files in the following places to roll back the 
specified number of batches:
    the offsets that are present in each batch
    versions of the state store
    the files lists stored for the FileStreamSource
    the metadata log stored by the FileStreamSink
    
    @marmbrus @zsxwing 
    
    ## How was this patch tested?
    
    The following tests were added.
    
    ### StreamExecution offset metadata
    Test added to StreamingQuerySuite that ensures offset metadata is garbage 
collected according to minBatchesRetain
    
    ### CompactibleFileStreamLog
    Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged 
starting before the first compaction file that proceeds the current batch id - 
minBatchesToRetain.  
    
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tcondie/spark offset_hist

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16219.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16219
    
----
commit 2d96af3ad4fe3d07cd80c727d2527de1d0ba3c57
Author: Tyson Condie <[email protected]>
Date:   2016-12-08T19:02:33Z

    revised log history maintenence based on minBatchesToRetain configuration 
parameter

commit fc1557eb178d070814776ffaa6c14a8cb48ea83a
Author: Tyson Condie <[email protected]>
Date:   2016-12-08T20:42:57Z

    add test for metadata garbage collection based on minBatchesToRetain

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to