[jira] [Commented] (FLINK-8360) Implement task-local state recovery

ASF GitHub Bot (JIRA) Tue, 09 Jan 2018 11:12:24 -0800

    [ 
https://issues.apache.org/jira/browse/FLINK-8360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318974#comment-16318974
 ]


ASF GitHub Bot commented on FLINK-8360:
---------------------------------------

Github user StephanEwen commented on the issue:

    https://github.com/apache/flink/pull/5239
  
    I did a peer review and walk through the code.
    Overall, the design look good, +1 for that!
    
    Some comments:
      - I would argue to change the way that these recovery options are 
configured. Currently, this goes through methods on one state backend objects, 
i.e., *configuration in code*. Because that recovery aspect is a really an 
"ops-related" aspect of running a Flink job (or a broader streaming platform), 
it should not be configured in code, but in the config. I found it helpful to 
thing that settings in code are for what concerns the application developers, 
settings in the config for what concerns the people that run Flink. They may be 
the same person in the end, but even then it is helpful because they are 
frequently are in different stages of the application development and 
deployment. Configurations are more easy to "standardize on", like "we want all 
applications in that group to enable local recovery".
    
      - One thing I am not yet 100% sure of is how this will interact in the 
future with RocksDB's optimized local recovery. I assume that checkpoints will 
in the future always use incremental snapshots. For such, there is no stream of 
bytes to store locally in addition. The files are already local and immutable. 
Here, the RocksDB snapshot should probably directly go through the local 
recovery directory, and the diff files would be persisted from there (the 
complete snapshot, which consists only of hardlinks to the files that are also 
in the work directory, would be retained though). Is the assumption that this 
is a "retain data structure" style mechanism, bespoke for each state backend, 
similar as retaining the heap copy-on-write table for the Heap State Backend?
    
    Now, since this PR is already complicated and needs a heavy rebase, I would 
be okay with doing that in another PR, if there is commitment to do this soon 
(before the 1.5 release branch is cut).
    
    Slightly off topic: This code has a very distinct style of using many 
`@Nonnull` annotations. Other newer parts of the code (the once that use 
annotations) follow the contract "non-null unless annotated with `@Nullable`". 
I don't ask to change this. Would be good to actually have a discussion and 
come up with a recommended style to agree on for the future.



> Implement task-local state recovery
> -----------------------------------
>
>                 Key: FLINK-8360
>                 URL: https://issues.apache.org/jira/browse/FLINK-8360
>             Project: Flink
>          Issue Type: New Feature
>          Components: State Backends, Checkpointing
>            Reporter: Stefan Richter
>            Assignee: Stefan Richter
>             Fix For: 1.5.0
>
>
> This issue tracks the development of recovery from task-local state. The main 
> idea is to have a secondary, local copy of the checkpointed state, while 
> there is still a primary copy in DFS that we report to the checkpoint 
> coordinator.
> Recovery can attempt to restore from the secondary local copy, if available, 
> to save network bandwidth. This requires that the assignment from tasks to 
> slots is as sticky is possible.
> For starters, we will implement this feature for all managed keyed states and 
> can easily enhance it to all other state types (e.g. operator state) later.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (FLINK-8360) Implement task-local state recovery

Reply via email to