[jira] [Updated] (SPARK-9947) Separate Metadata and State Checkpoint Data

Dan Dutrow (JIRA) Thu, 13 Aug 2015 14:03:11 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-9947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dan Dutrow updated SPARK-9947:
------------------------------
                 Flags: Important
     Affects Version/s: 1.4.1
      Target Version/s: 1.5.0  (was: 1.4.0)
    Remaining Estimate: 168h
     Original Estimate: 168h
              Priority: Minor  (was: Major)
           Description: 
Problem: When updating an application that has checkpointing enabled to support 
the updateStateByKey functionality, you encounter the problem where you might 
like to maintain state data between restarts but delete the metadata containing 
execution state. 

If checkpoint data exists between code redeployment, the program may not 
execute properly or at all. My current workaround for this issue is to wrap 
updateStateByKey with my own function that persists the state after every 
update to my own separate directory. (That allows me to delete the checkpoint 
with its metadata before redeploying) Then, when I restart the application, I 
initialize the state with this persisted data. This incurs additional overhead 
due to persisting of the same data twice: once in the checkpoint and once in my 
persisted data folder. 

If Kafka Direct API offsets could be stored in another separate checkpoint 
directory, that would help address the problem of having to blow that away 
between code redeployment as well.

  was:
This is the proposal. 

The simpler direct API (the one that does not take explicit offsets) can be 
modified to also pick up the initial offset from ZK if group.id is specified. 
This is exactly similar to how we find the latest or earliest offset in that 
API, just that instead of latest/earliest offset of the topic we want to find 
the offset from the consumer group. The group offsets is ZK is not used at all 
for any further processing and restarting, so the exactly-once semantics is not 
broken. 

The use case where this is useful is simplified code upgrade. If the user wants 
to upgrade the code, he/she can the context stop gracefully which will ensure 
the ZK consumer group offset will be updated with the last offsets processed. 
Then the new code is started (not restarted from checkpoint) can pickup  the 
consumer group offset from ZK and continue where the previous code had left 
off. 

Without the functionality of picking up consumer group offsets to start (that 
is, currently) the only way to do this is for the users to save the offsets 
somewhere (file, database, etc.) and manage the offsets themselves. I just want 
to simplify this process. 


> Separate Metadata and State Checkpoint Data
> -------------------------------------------
>
>                 Key: SPARK-9947
>                 URL: https://issues.apache.org/jira/browse/SPARK-9947
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>    Affects Versions: 1.4.1
>            Reporter: Dan Dutrow
>            Priority: Minor
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> Problem: When updating an application that has checkpointing enabled to 
> support the updateStateByKey functionality, you encounter the problem where 
> you might like to maintain state data between restarts but delete the 
> metadata containing execution state. 
> If checkpoint data exists between code redeployment, the program may not 
> execute properly or at all. My current workaround for this issue is to wrap 
> updateStateByKey with my own function that persists the state after every 
> update to my own separate directory. (That allows me to delete the checkpoint 
> with its metadata before redeploying) Then, when I restart the application, I 
> initialize the state with this persisted data. This incurs additional 
> overhead due to persisting of the same data twice: once in the checkpoint and 
> once in my persisted data folder. 
> If Kafka Direct API offsets could be stored in another separate checkpoint 
> directory, that would help address the problem of having to blow that away 
> between code redeployment as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-9947) Separate Metadata and State Checkpoint Data

Reply via email to