[ 
https://issues.apache.org/jira/browse/FLINK-5036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15650460#comment-15650460
 ] 

Stefan Richter commented on FLINK-5036:
---------------------------------------

>From the description it is not entirely clear if you are discussing a 
>theoretical or a concrete problem that you observed. If there is an observable 
>performance problem, could you please provide some backing measurements (log 
>outputs), the Flink version, and information about your keys and values to 
>help us figure out the cause and extend of the problem?

If we are talking about a theoretical problem, the way you describe the effect 
of key-groups on snapshotting in the {{RocksDBKeyedStateBackend}} is not 
accurate. What was actually changed for key-groups in the current master is 
that each key in RocksDB is now prefixed by it's corresponding key-group ID 
(1-2 byte, depending on maxParallelism). This allows us to iterate all 
key-value pairs in key-grouped order at snapshot time, similar to how we 
previously iterated them in key-order. Furthermore, all key-groups go to the 
same file and are written consecutively without random IOs, and a small index 
of key-group-id -> offset is created. From this point of view, the performance 
impact of key-groups on snapshot time should (hopefully) be marginal.

As another remark, non-partitioned state is snapshotted/restored in a similar 
way. Avoiding key-groups at snapshot times has also more problematic 
implications for restores than you consider in the description. For example, if 
the new parallelism is not a multiple of the old parallelism, we effectively 
have to read and filter the completed keyed state for each task that works on 
it.



> Perform the grouping of keys in restoring instead of checkpointing
> ------------------------------------------------------------------
>
>                 Key: FLINK-5036
>                 URL: https://issues.apache.org/jira/browse/FLINK-5036
>             Project: Flink
>          Issue Type: Bug
>          Components: State Backends, Checkpointing
>            Reporter: Xiaogang Shi
>
> Whenever taking snapshots of {{RocksDBKeyedStateBackend}}, the values in the 
> states will be written onto different files according to their key groups. 
> The procedure is very costly when the states are very big. 
> Given that the snapshot operations will be performed much more frequently 
> than restoring, we can leave the key groups as they are to improve the 
> overall performance. In other words, we can perform the grouping of keys in 
> restoring instead of in checkpointing.
> I think, the implementation will be very similar to the restoring of 
> non-partitioned states. Each task will receive a collection of snapshots each 
> of which contains a set of key groups. Each task will restore its states from 
> the given snapshots by picking values in assigned key groups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to