Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Yun Tang Fri, 29 Mar 2024 19:42:55 -0700

Hi Jinzhong,

Yes, I know the cleanup mechanism for the remote working directory is the same 
as the current Rocksdb state-backend. However, the impact of the residual files 
in the remote working directory is different compared with residual files in 
the local directory, especially Flink just try the best to clean up during 
stateBackend#dispose.


I agree that we could leave the optimization in the future FLIP, however, I 
think we should mention this topic in the current FLIP to make the overall 
design more complete and sophisticated.


Best
Yun Tang
________________________________
From: Jinzhong Li <lijinzhong2...@gmail.com>
Sent: Thursday, March 28, 2024 12:45
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for 
Disaggregated State

Hi Feifan,

Sorry for the misunderstanding. As Hangxiang explained, the basic cleanup
mechanism for remote working directory is the same as rocksdb-statebackend,
that is, when TM exits, forst-statebackend will delete the entire working
dir. Regarding orphaned files cleanup in the case of TM crash, we will
address it in the future FLIP.

Best,
Jinzhong

On Thu, Mar 28, 2024 at 12:35 PM Hangxiang Yu <master...@gmail.com> wrote:

> Hi, Yun and Feifan.
>
> Thanks for your reply.
>
> About the cleanup of working dir, as mentioned in FLIP-427, "The life cycle
> of working dir is managed as before local strategy.".
> Since the current working dir and checkpoint dir are separate, The life
> cycle including creating and cleanup of working dir could be aligned with
> before easily.
>
> On Thu, Mar 28, 2024 at 12:07 PM Feifan Wang <zoltar9...@163.com> wrote:
>
> > And I think the cleanup of working dir should be discussion in
> FLIP-427[1]
> > ( this mail list [2]) ?
> >
> >
> > [1] https://cwiki.apache.org/confluence/x/T4p3EQ
> > [2] https://lists.apache.org/thread/vktfzqvb7t4rltg7fdlsyd9sfdmrc4ft
> >
> > ——————————————
> >
> > Best regards,
> >
> > Feifan Wang
> >
> >
> >
> >
> > At 2024-03-28 11:56:22, "Feifan Wang" <zoltar9...@163.com> wrote:
> > >Hi Jinzhong :
> > >
> > >
> > >> I suggest that we could postpone this topic for now and consider it
> > comprehensively combined with the TM ownership file management in the
> > future FLIP.
> > >
> > >
> > >Sorry I still think we should consider the cleanup of the working dir in
> > this FLIP, although we may come up with a better solution in a subsequent
> > flip, I think it is important to maintain the integrity of the current
> > changes. Otherwise we may suffer from wasted DFS space for some time.
> > >Perhaps we only need a simple cleanup strategy at this stage, such as
> > proactive cleanup when TM exits. While this may fail in the case of a TM
> > crash, it already alleviates the problem.
> > >
> > >
> > >
> > >
> > >——————————————
> > >
> > >Best regards,
> > >
> > >Feifan Wang
> > >
> > >
> > >
> > >
> > >At 2024-03-28 11:15:11, "Jinzhong Li" <lijinzhong2...@gmail.com> wrote:
> > >>Hi Yun,
> > >>
> > >>Thanks for your reply.
> > >>
> > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir'
> > >>> under the shared directory? if we don't consider making
> > >>> TM ownership in this FLIP, this design seems unnecessary.
> > >>
> > >> Good catch! We will not change the directory layout of shared
> directory
> > in
> > >>this FLIP. I have already removed this part from this FLIP. I think we
> > >>could revisit this topic in a future FLIP about TM ownership.
> > >>
> > >>> 2. This FLIP forgets to mention the cleanup of the remote
> > >>> working directory in case of the taskmanager crushes,
> > >>> even though this is an open problem, we can still leave
> > >>> some space for future optimization.
> > >>
> > >>Considering that we have plans to merge TM working dir and checkpoint
> dir
> > >>into one directory, I suggest that we could postpone this topic for now
> > and
> > >>consider it comprehensively combined with the TM ownership file
> > management
> > >>in the future FLIP.
> > >>
> > >>Best,
> > >>Jinzhong
> > >>
> > >>
> > >>
> > >>On Wed, Mar 27, 2024 at 11:49 PM Yun Tang <myas...@live.com> wrote:
> > >>
> > >>> Hi Jinzhong,
> > >>>
> > >>> The overall design looks good.
> > >>>
> > >>> I have two minor questions:
> > >>>
> > >>> 1. Why must we have another 'subTask-checkpoint-sub-dir' under the
> > shared
> > >>> directory? if we don't consider making TM ownership in this FLIP,
> this
> > >>> design seems unnecessary.
> > >>> 2. This FLIP forgets to mention the cleanup of the remote working
> > >>> directory in case of the taskmanager crushes, even though this is an
> > open
> > >>> problem, we can still leave some space for future optimization.
> > >>>
> > >>> Best,
> > >>> Yun Tang
> > >>>
> > >>> ________________________________
> > >>> From: Jinzhong Li <lijinzhong2...@gmail.com>
> > >>> Sent: Monday, March 25, 2024 10:41
> > >>> To: dev@flink.apache.org <dev@flink.apache.org>
> > >>> Subject: Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration
> > for
> > >>> Disaggregated State
> > >>>
> > >>> Hi Yue,
> > >>>
> > >>> Thanks for your comments.
> > >>>
> > >>> The CURRENT is a special file that points to the latest manifest log
> > >>> file. As Zakelly explained above, we could record the latest manifest
> > >>> filename during sync phase, and write the filename into CURRENT
> > snapshot
> > >>> file during async phase.
> > >>>
> > >>> Best,
> > >>> Jinzhong
> > >>>
> > >>> On Fri, Mar 22, 2024 at 11:16 PM Zakelly Lan <zakelly....@gmail.com>
> > >>> wrote:
> > >>>
> > >>> > Hi Yue,
> > >>> >
> > >>> > Thanks for bringing this up!
> > >>> >
> > >>> > The CURRENT FILE is the special one, which should be snapshot
> during
> > the
> > >>> > sync phase (temporary load into memory). Thus we can solve this.
> > >>> >
> > >>> >
> > >>> > Best,
> > >>> > Zakelly
> > >>> >
> > >>> > On Fri, Mar 22, 2024 at 4:55 PM yue ma <mayuefi...@gmail.com>
> wrote:
> > >>> >
> > >>> > > Hi jinzhong,
> > >>> > > Thanks for you reply. I still have some doubts about the first
> > >>> question.
> > >>> > Is
> > >>> > > there such a case
> > >>> > > When you made a snapshot during the synchronization phase, you
> > recorded
> > >>> > the
> > >>> > > current and manifest 8, but before asynchronous phase, the
> manifest
> > >>> > reached
> > >>> > > the size threshold and then the CURRENT FILE pointed to the new
> > >>> manifest
> > >>> > 9,
> > >>> > > and then uploaded the incorrect CURRENT file ?
> > >>> > >
> > >>> > > Jinzhong Li <lijinzhong2...@gmail.com> 于2024年3月20日周三 20:13写道：
> > >>> > >
> > >>> > > > Hi Yue,
> > >>> > > >
> > >>> > > > Thanks for your feedback!
> > >>> > > >
> > >>> > > > > 1. If we choose Option-3 for ForSt , how would we handle
> > Manifest
> > >>> > File
> > >>> > > > > ? Should we take a snapshot of the Manifest during the
> > >>> > synchronization
> > >>> > > > phase?
> > >>> > > >
> > >>> > > > IIUC, the GetLiveFiles() API in Option-3 can also catch the
> > fileInfo
> > >>> of
> > >>> > > > Manifest files, and this api also return the manifest file
> size,
> > >>> which
> > >>> > > > means this api could take snapshot for Manifest FileInfo
> > (filename +
> > >>> > > > fileSize) during the synchronization phase.
> > >>> > > > You could refer to the rocksdb source code[1] to verify this.
> > >>> > > >
> > >>> > > >
> > >>> > > >  > However, many distributed storage systems do not support the
> > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the
> > ability
> > >>> > to
> > >>> > > > > directly read and write remote files. Can we not copy or Fast
> > >>> > duplicate
> > >>> > > > > these files, but instand of directly reuse and. reference
> these
> > >>> > remote
> > >>> > > > > files? I think this can reduce file download time and may be
> > more
> > >>> > > useful
> > >>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
> > >>> > > >
> > >>> > > > Firstly, as far as I know, most remote file systems support the
> > >>> > > > FastDuplicate, eg. S3 copyObject/Azure Blob Storage
> copyBlob/OSS
> > >>> > > > copyObject, and the HDFS indeed does not support FastDuplicate.
> > >>> > > >
> > >>> > > > Actually，we have considered the design which reuses remote
> > files. And
> > >>> > > that
> > >>> > > > is what we want to implement in the coming future, where both
> > >>> > checkpoints
> > >>> > > > and restores can reuse existing files residing on the remote
> > state
> > >>> > > storage.
> > >>> > > > However, this design conflicts with the current file management
> > >>> system
> > >>> > in
> > >>> > > > Flink.  At present, remote state files are managed by the
> ForStDB
> > >>> > > > (TaskManager side), while checkpoint files are managed by the
> > >>> > JobManager,
> > >>> > > > which is a major hindrance to file reuse. For example, issues
> > could
> > >>> > arise
> > >>> > > > if a TM reuses a checkpoint file that is subsequently deleted
> by
> > the
> > >>> > JM.
> > >>> > > > Therefore, as mentioned in FLIP-423[2], our roadmap is to first
> > >>> > integrate
> > >>> > > > checkpoint/restore mechanisms with existing framework  at
> > >>> milestone-1.
> > >>> > > > Then, at milestone-2, we plan to introduce TM State Ownership
> and
> > >>> > Faster
> > >>> > > > Checkpointing mechanisms, which will allow both checkpointing
> and
> > >>> > > restoring
> > >>> > > > to directly reuse remote files, thus achieving faster
> > checkpointing
> > >>> and
> > >>> > > > restoring.
> > >>> > > >
> > >>> > > > [1]
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> >
> https://github.com/facebook/rocksdb/blob/6ddfa5f06140c8d0726b561e16dc6894138bcfa0/db/db_filesnapshot.cc#L77
> > >>> > > > [2]
> > >>> > > >
> > >>> > > >
> > >>> > >
> > >>> >
> > >>>
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=293046855#FLIP423:DisaggregatedStateStorageandManagement(UmbrellaFLIP)-RoadMap+LaunchingPlan
> > >>> > > >
> > >>> > > > Best,
> > >>> > > > Jinzhong
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > >
> > >>> > > > On Wed, Mar 20, 2024 at 4:01 PM yue ma <mayuefi...@gmail.com>
> > wrote:
> > >>> > > >
> > >>> > > > > Hi Jinzhong
> > >>> > > > >
> > >>> > > > > Thank you for initiating this FLIP.
> > >>> > > > >
> > >>> > > > > I have just some minor question:
> > >>> > > > >
> > >>> > > > > 1. If we choice Option-3 for ForSt , how would we handle
> > Manifest
> > >>> > File
> > >>> > > > > ? Should we take snapshot of the Manifest during the
> > >>> synchronization
> > >>> > > > phase?
> > >>> > > > > Otherwise, may the Manifest and MetaInfo information be
> > >>> inconsistent
> > >>> > > > during
> > >>> > > > > recovery?
> > >>> > > > > 2. For the Restore Operation , we need Fast Duplicate
> > Checkpoint
> > >>> > Files
> > >>> > > > to
> > >>> > > > > Working Dir . However, many distributed storage systems do
> not
> > >>> > support
> > >>> > > > the
> > >>> > > > > ability of Fast Duplicate (such as HDFS). But ForSt has the
> > ability
> > >>> > to
> > >>> > > > > directly read and write remote files. Can we not copy or Fast
> > >>> > duplicate
> > >>> > > > > these files, but instand of directly reuse and. reference
> these
> > >>> > remote
> > >>> > > > > files? I think this can reduce file download time and may be
> > more
> > >>> > > useful
> > >>> > > > > for most users who use HDFS (do not support Fast Duplicate)?
> > >>> > > > >
> > >>> > > > > --
> > >>> > > > > Best,
> > >>> > > > > Yue
> > >>> > > > >
> > >>> > > >
> > >>> > >
> > >>> > >
> > >>> > > --
> > >>> > > Best,
> > >>> > > Yue
> > >>> > >
> > >>> >
> > >>>
> >
>
>
> --
> Best,
> Hangxiang.
>

Re: [DISCUSS] FLIP-428: Fault Tolerance/Rescale Integration for Disaggregated State

Reply via email to