[jira] [Created] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

huangtengfei (JIRA) Tue, 22 May 2018 08:46:17 -0700

huangtengfei created SPARK-24351:
------------------------------------

             Summary: offsetLog/commitLog purge thresholdBatchId should be 
computed with current committed epoch but not currentBatchId in CP mode
                 Key: SPARK-24351
                 URL: https://issues.apache.org/jira/browse/SPARK-24351
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 2.3.0
            Reporter: huangtengfei



In structured streaming, there is a conf spark.sql.streaming.minBatchesToRetain 
which is set to specify 'The minimum number of batches that must be retained 
and made recoverable' as described in 
[SQLConf](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala#L802).
 In continuous processing, the metadata purge is triggered when an epoch is 
committed in 
[ContinuousExecution](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/continuous/ContinuousExecution.scala#L306).
 
Since currentBatchId increases independently in cp mode, the current committed 
epoch may be far behind currentBatchId if some task hangs for some time. It is 
not safe to discard the metadata with thresholdBatchId computed based on 
currentBatchId because we may clean all the metadata in the checkpoint 
directory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-24351) offsetLog/commitLog purge thresholdBatchId should be computed with current committed epoch but not currentBatchId in CP mode

Reply via email to