GitHub user tcondie opened a pull request:
https://github.com/apache/spark/pull/16219
[SPARK-18790][SS] Keep a general offset history of stream batches
## What changes were proposed in this pull request?
Instead of only keeping the minimum number of offsets around, we should
keep enough information to allow us to roll back n batches and reexecute the
stream starting from a given point. In particular, we should create a config in
SQLConf, spark.sql.streaming.retainedBatches that defaults to 100 and ensure
that we keep enough log files in the following places to roll back the
specified number of batches:
the offsets that are present in each batch
versions of the state store
the files lists stored for the FileStreamSource
the metadata log stored by the FileStreamSink
@marmbrus @zsxwing
## How was this patch tested?
The following tests were added.
### StreamExecution offset metadata
Test added to StreamingQuerySuite that ensures offset metadata is garbage
collected according to minBatchesRetain
### CompactibleFileStreamLog
Tests added in CompactibleFileStreamLogSuite to ensure that logs are purged
starting before the first compaction file that proceeds the current batch id -
minBatchesToRetain.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tcondie/spark offset_hist
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16219.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16219
----
commit 2d96af3ad4fe3d07cd80c727d2527de1d0ba3c57
Author: Tyson Condie <[email protected]>
Date: 2016-12-08T19:02:33Z
revised log history maintenence based on minBatchesToRetain configuration
parameter
commit fc1557eb178d070814776ffaa6c14a8cb48ea83a
Author: Tyson Condie <[email protected]>
Date: 2016-12-08T20:42:57Z
add test for metadata garbage collection based on minBatchesToRetain
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]