[ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dan Dutrow updated SPARK-9947:
------------------------------
    Description: 
Problem: When updating an application that has checkpointing enabled to support 
the updateStateByKey and 24/7 operation functionality, you encounter the 
problem where you might like to maintain state data between restarts but delete 
the metadata containing execution state. 

If checkpoint data exists between code redeployment, the program may not 
execute properly or at all. My current workaround for this issue is to wrap 
updateStateByKey with my own function that persists the state after every 
update to my own separate directory. (That allows me to delete the checkpoint 
with its metadata before redeploying) Then, when I restart the application, I 
initialize the state with this persisted data. This incurs additional overhead 
due to persisting of the same data twice: once in the checkpoint and once in my 
persisted data folder. 

If Kafka Direct API offsets could be stored in another separate checkpoint 
directory, that would help address the problem of having to blow that away 
between code redeployment as well.

  was:
Problem: When updating an application that has checkpointing enabled to support 
the updateStateByKey functionality, you encounter the problem where you might 
like to maintain state data between restarts but delete the metadata containing 
execution state. 

If checkpoint data exists between code redeployment, the program may not 
execute properly or at all. My current workaround for this issue is to wrap 
updateStateByKey with my own function that persists the state after every 
update to my own separate directory. (That allows me to delete the checkpoint 
with its metadata before redeploying) Then, when I restart the application, I 
initialize the state with this persisted data. This incurs additional overhead 
due to persisting of the same data twice: once in the checkpoint and once in my 
persisted data folder. 

If Kafka Direct API offsets could be stored in another separate checkpoint 
directory, that would help address the problem of having to blow that away 
between code redeployment as well.


> Separate Metadata and State Checkpoint Data
> -------------------------------------------
>
>                 Key: SPARK-9947
>                 URL: https://issues.apache.org/jira/browse/SPARK-9947
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.4.1
>            Reporter: Dan Dutrow
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey and 24/7 operation functionality, you encounter 
> the problem where you might like to maintain state data between restarts but 
> delete the metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to