[
https://issues.apache.org/jira/browse/FLINK-7873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16266966#comment-16266966
]
ASF GitHub Bot commented on FLINK-7873:
---------------------------------------
Github user StefanRRichter commented on a diff in the pull request:
https://github.com/apache/flink/pull/5074#discussion_r153235421
--- Diff:
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/savepoint/SavepointV2Serializer.java
---
@@ -510,6 +512,13 @@ private static void serializeStreamStateHandle(
byte[] internalData = byteStreamStateHandle.getData();
dos.writeInt(internalData.length);
dos.write(byteStreamStateHandle.getData());
+ } else if (stateHandle instanceof CachedStreamStateHandle) {
--- End diff --
This means we are actually introducing significant new code to the job
manager, that even impacts the serialization format. I think this should not
strictly be required if we map local state to checkpoint ids.
> Introduce CheckpointCacheManager for reading checkpoint data locally when
> performing failover
> ---------------------------------------------------------------------------------------------
>
> Key: FLINK-7873
> URL: https://issues.apache.org/jira/browse/FLINK-7873
> Project: Flink
> Issue Type: New Feature
> Components: State Backends, Checkpointing
> Affects Versions: 1.3.2
> Reporter: Sihua Zhou
> Assignee: Sihua Zhou
>
> Why i introduce this:
> Current recover strategy will always read checkpoint data from remote
> FileStream (HDFS). This will cost a lot of bandwidth when the state is so big
> (e.g. 1T). What's worse, if this job performs recover again and again, it can
> eat up all network bandwidth and do a huge hurt to cluster. So, I proposed
> that we can cache the checkpoint data locally, and read checkpoint data from
> local cache as well as we can, we read the data from remote only if we fail
> locally. The advantage is that if a execution is assigned to the same
> TaskManager as before, it can save a lot of bandwith, and obtain a faster
> recover.
> Solution:
> TaskManager do the cache job and manage the cached data itself. It simple
> use a TTL-like method to manage cache entry's dispose, we dispose a entry if
> it wasn't be touched for a X time, once we touch a entry we reset the TTL for
> it. In this way, all jobs is done by TaskManager, it transparent to
> JobManager. The only problem is that we may dispose a entry that maybe
> useful, in this case, we have to read from remote data finally, but users can
> avoid this by set a proper TTL value according to checkpoint interval and
> other things.
> Can someone give me some advice? I would appreciate it very much~
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)