[ 
https://issues.apache.org/jira/browse/FLINK-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15650462#comment-15650462
 ] 

Xiaogang Shi commented on FLINK-5036:
-------------------------------------

Current implementation of both checkpointing and restoring requires to iterate 
over all key-value pairs. When the states are very big (up to multiple GBs or 
TBs, which are usual cases in our daily jobs), the performance is obviously 
unacceptable.

Let me explain more details about my proposal. By not organizing kv pairs into 
grouping, we can avoid the iterating and can directly copy the files of RocksDB 
onto HDFS. Of course, we should record all the key groups contained in the 
produced snapshot.

When restoring from snapshots, the master will give each task all the rocksdb 
that contain assigned key groups. Tasks can pick those keys assigned to them by 
accessing these rocksdbs. If the key groups in a rocksdb are all assigned to 
the task, then the task can avoid unnecessary picking. 

In most cases where the degree of parallelism is not changed, fast recovery can 
be achieved because states can be restored by simply copying the files from 
HDFS. In all cases, the performance will be much better than existing 
implementation which needs costly iterating.



> Perform the grouping of keys in restoring instead of checkpointing
> ------------------------------------------------------------------
>
>                 Key: FLINK-5036
>                 URL: https://issues.apache.org/jira/browse/FLINK-5036
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>            Reporter: Xiaogang Shi
>
> Whenever taking snapshots of {{RocksDBKeyedStateBackend}}, the values in the 
> states will be written onto different files according to their key groups. 
> The procedure is very costly when the states are very big. 
> Given that the snapshot operations will be performed much more frequently 
> than restoring, we can leave the key groups as they are to improve the 
> overall performance. In other words, we can perform the grouping of keys in 
> restoring instead of in checkpointing.
> I think, the implementation will be very similar to the restoring of 
> non-partitioned states. Each task will receive a collection of snapshots each 
> of which contains a set of key groups. Each task will restore its states from 
> the given snapshots by picking values in assigned key groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to