[ 
https://issues.apache.org/jira/browse/FLINK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253871#comment-17253871
 ] 

weiyunqing commented on FLINK-19013:
------------------------------------

[~wangm92]
On this issue, we have made some internal researches and studied many reasons

1. When the incremental rocksdb is used to save the state data, if the 
parallelism is modified, the bottleneck of recovery time is CPU and mem. (flink 
does not provide a separate rocksdb for recovery, we have improved it here, 
which means that we can adjust the rocksdb configuration for recovery without 
affecting the runtime) Secondly, during the recovery period, there will be 
other tasks on the same slot that have been recovered and are running. At this 
time, part of the CPU of the process will be lost to these tasks. In this 
regard, we have optimized it to save about 30% of the recovery time.



2. When the job manager, file system or rocksdb are backend, the task needs to 
read the full data for state recovery. If it is not compressed, it is not 
recommended to do so, and the recovery time may be long



I hope it can help you

> Log start/end of state restoration
> ----------------------------------
>
>                 Key: FLINK-19013
>                 URL: https://issues.apache.org/jira/browse/FLINK-19013
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>            Reporter: Chesnay Schepler
>            Assignee: Yun Tang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0, 1.12.1
>
>
> State restoration can take a significant amount of time if the state is large 
> enough, or in special cases like FLINK-19008.
> It would be useful for debugging if we'd log the start/end of 
> {{RestoreOperation#restore.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to