[
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tathagata Das updated SPARK-9947:
---------------------------------
Target Version/s: (was: 1.5.0)
> Separate Metadata and State Checkpoint Data
> -------------------------------------------
>
> Key: SPARK-9947
> URL: https://issues.apache.org/jira/browse/SPARK-9947
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Affects Versions: 1.4.1
> Reporter: Dan Dutrow
> Original Estimate: 168h
> Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to
> support the updateStateByKey and 24/7 operation functionality, you encounter
> the problem where you might like to maintain state data between restarts but
> delete the metadata containing execution state.
> If checkpoint data exists between code redeployment, the program may not
> execute properly or at all. My current workaround for this issue is to wrap
> updateStateByKey with my own function that persists the state after every
> update to my own separate directory. (That allows me to delete the checkpoint
> with its metadata before redeploying) Then, when I restart the application, I
> initialize the state with this persisted data. This incurs additional
> overhead due to persisting of the same data twice: once in the checkpoint and
> once in my persisted data folder.
> If Kafka Direct API offsets could be stored in another separate checkpoint
> directory, that would help address the problem of having to blow that away
> between code redeployment as well.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]