[
https://issues.apache.org/jira/browse/FLINK-19013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17253871#comment-17253871
]
weiyunqing commented on FLINK-19013:
------------------------------------
[~wangm92]
On this issue, we have made some internal researches and studied many reasons
1. When the incremental rocksdb is used to save the state data, if the
parallelism is modified, the bottleneck of recovery time is CPU and mem. (flink
does not provide a separate rocksdb for recovery, we have improved it here,
which means that we can adjust the rocksdb configuration for recovery without
affecting the runtime) Secondly, during the recovery period, there will be
other tasks on the same slot that have been recovered and are running. At this
time, part of the CPU of the process will be lost to these tasks. In this
regard, we have optimized it to save about 30% of the recovery time.
2. When the job manager, file system or rocksdb are backend, the task needs to
read the full data for state recovery. If it is not compressed, it is not
recommended to do so, and the recovery time may be long
I hope it can help you
> Log start/end of state restoration
> ----------------------------------
>
> Key: FLINK-19013
> URL: https://issues.apache.org/jira/browse/FLINK-19013
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / State Backends
> Reporter: Chesnay Schepler
> Assignee: Yun Tang
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.13.0, 1.12.1
>
>
> State restoration can take a significant amount of time if the state is large
> enough, or in special cases like FLINK-19008.
> It would be useful for debugging if we'd log the start/end of
> {{RestoreOperation#restore.}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)