Re: [DISCUSS] FLIP-427: Disaggregated State Store

Yun Tang Fri, 29 Mar 2024 19:48:22 -0700

Hi Feifan,

I just replied in the discussion of FLIP-428. I agree that we could leave the 
clean-up optimization in the future FLIP, however, I think we should mention 
this topic explicitly in the current FLIP to make the overall design complete 
and more sophisticated.


Best
Yun Tang
________________________________
From: Feifan Wang <zoltar9...@163.com>
Sent: Thursday, March 28, 2024 12:35
To: dev@flink.apache.org <dev@flink.apache.org>
Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store

Thanks for your reply, Hangxiang. I totally agree with you about the jni part.

Hi Yun Tang, I just noticed that FLIP-427 mentions “The life cycle of working 
dir is managed as before local strategy.” IIUC, the working dir will be deleted 
after TaskManager exit. And I think that's enough for current stage, WDYT ?

——————————————

Best regards,

Feifan Wang




At 2024-03-28 12:18:56, "Hangxiang Yu" <master...@gmail.com> wrote:
>Hi, Feifan.
>
>Thanks for your reply.
>
>What if we only use jni to access DFS that needs to reuse Flink FileSystem?
>> And all local disk access through native api. This idea is based on the
>> understanding that jni overhead is not worth mentioning compared to DFS
>> access latency. It might make more sense to consider avoiding jni overhead
>> for faster local disks. Since local disk as secondary is already under
>> consideration [1], maybe we can discuss in that FLIP whether to use native
>> api to access local disk?
>>
>This is a good suggestion. It's reasonable to use native api to access
>local disk cache since it requires lower latency compared to remote access.
>I also believe that the jni overhead is relatively negligible when weighed
>against the latency of remote I/O as mentioned in the FLIP.
>So I think we could just go on proposal 2 and keep proposal 1 as a
>potential future optimization, which could work better when there is a
>higher performance requirement or some native libraries of filesystems have
>significantly higher performance and resource usage compared to their java
>libs.
>
>
>On Thu, Mar 28, 2024 at 11:39 AM Feifan Wang <zoltar9...@163.com> wrote:
>
>> Thanks for this valuable proposal Hangxiang !
>>
>>
>> > If we need to introduce a JNI call during each filesystem call, that
>> would be N times JNI cost compared with the current RocksDB state-backend's
>> JNI cost.
>> What if we only use jni to access DFS that needs to reuse Flink
>> FileSystem? And all local disk access through native api. This idea is
>> based on the understanding that jni overhead is not worth mentioning
>> compared to DFS access latency. It might make more sense to consider
>> avoiding jni overhead for faster local disks. Since local disk as secondary
>> is already under consideration [1], maybe we can discuss in that FLIP
>> whether to use native api to access local disk?
>>
>>
>> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>> >Different disaggregated state storages may have their own semantics about
>> >this configuration, e.g. life cycle, supported file systems or storages.
>> I agree with considering moving this configuration up to the engine level
>> until there are other disaggreated backends.
>>
>>
>> [1] https://cwiki.apache.org/confluence/x/U4p3EQ
>>
>> ——————————————
>>
>> Best regards,
>>
>> Feifan Wang
>>
>>
>>
>>
>> At 2024-03-28 09:55:48, "Hangxiang Yu" <master...@gmail.com> wrote:
>> >Hi, Yun.
>> >Thanks for the reply.
>> >
>> >The JNI cost you considered is right. As replied to Yue, I agreed to leave
>> >space and consider proposal 1 as an optimization in the future, which is
>> >also updated in the FLIP.
>> >
>> >The other question is that the configuration of
>> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> >> state-backend, how would it be if we introduce another disaggregated
>> state
>> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might
>> be a
>> >> better configuration name.
>> >
>> >I'd suggest keeping `state.backend.forSt.working-dir` as it is for now.
>> >Different disaggregated state storages may have their own semantics about
>> >this configuration, e.g. life cycle, supported file systems or storages.
>> >Maybe it's more suitable to consider it together when we introduce other
>> >disaggregated state storages in the future.
>> >
>> >On Thu, Mar 28, 2024 at 12:02 AM Yun Tang <myas...@live.com> wrote:
>> >
>> >> Hi Hangxiang,
>> >>
>> >> The design looks good, and I also support leaving space for proposal 1.
>> >>
>> >> As you know, loading index/filter/data blocks for querying across levels
>> >> would introduce high IO access within the LSM tree for old data. If we
>> need
>> >> to introduce a JNI call during each filesystem call, that would be N
>> times
>> >> JNI cost compared with the current RocksDB state-backend's JNI cost.
>> >>
>> >> The other question is that the configuration of
>> >> `state.backend.forSt.working-dir` looks too coupled with the ForSt
>> >> state-backend, how would it be if we introduce another disaggregated
>> state
>> >> storage? Thus, I think `state.backend.disaggregated.working-dir` might
>> be a
>> >> better configuration name.
>> >>
>> >>
>> >> Best
>> >> Yun Tang
>> >>
>> >> ________________________________
>> >> From: Hangxiang Yu <master...@gmail.com>
>> >> Sent: Wednesday, March 20, 2024 11:32
>> >> To: dev@flink.apache.org <dev@flink.apache.org>
>> >> Subject: Re: [DISCUSS] FLIP-427: Disaggregated State Store
>> >>
>> >> Hi, Yue.
>> >> Thanks for the reply.
>> >>
>> >> If we use proposal1, we can easily reuse these optimizations .It is even
>> >> > possible to discuss and review the solution together in the Rocksdb
>> >> > community.
>> >>
>> >> We also saw these useful optimizations which could be applied to ForSt
>> in
>> >> the future.
>> >> But IIUC, it's not binding to proposal 1, right? We could also
>> >> implement interfaces about temperature and secondary cache to reuse
>> them,
>> >> or organize a more complex HybridEnv based on proposal 2.
>> >>
>> >> My point is whether we should retain the potential of proposal 1 in the
>> >> > design.
>> >> >
>> >> This is a good suggestion. We choose proposal 2 firstly due to its
>> >> maintainability and scalability, especially because it could leverage
>> all
>> >> filesystems flink supported conveniently.
>> >> Given the indelible advantage in performance, I think we could also
>> >> consider proposal 1 as an optimization in the future.
>> >> For the interface on the DB side, we could also expose more different
>> Envs
>> >> in the future.
>> >>
>> >>
>> >> On Tue, Mar 19, 2024 at 9:14 PM yue ma <mayuefi...@gmail.com> wrote:
>> >>
>> >> > Hi Hangxiang,
>> >> >
>> >> > Thanks for bringing this discussion.
>> >> > I have a few questions about the Proposal you mentioned in the FLIP.
>> >> >
>> >> > The current conclusion is to use proposal 2, which is okay for me. My
>> >> point
>> >> > is whether we should retain the potential of proposal 1 in the design.
>> >> > There are the following reasons:
>> >> > 1. No JNI overhead, just like the Performance Part mentioned in Flip
>> >> > 2. RocksDB currently also provides an interface for Env, and there are
>> >> also
>> >> > some implementations, such as HDFS-ENV, which seem to be easily
>> scalable.
>> >> > 3. The RocksDB community continues to support LSM for different
>> storage
>> >> > media, such as  Tiered Storage
>> >> > <
>> >> >
>> >>
>> https://github.com/facebook/rocksdb/wiki/Tiered-Storage-%28Experimental%29
>> >> > >
>> >> >       And some optimizations have been made for this scenario, such as
>> >> Per
>> >> > Key Placement Comparison
>> >> > <https://rocksdb.org/blog/2022/11/09/time-aware-tiered-storage.html>.
>> >> >      *Secondary cache
>> >> > <
>> >> >
>> >>
>> https://github.com/facebook/rocksdb/wiki/SecondaryCache-%28Experimental%29
>> >> > >*,
>> >> > similar to the Hybrid Block Cache mentioned in Flip-423
>> >> >  If we use proposal1, we can easily reuse these optimizations .It is
>> even
>> >> > possible to discuss and review the solution together in the Rocksdb
>> >> > community.
>> >> >  In fact, we have already implemented some production practices using
>> >> > Proposal1 internally. We have integrated HybridEnv, Tiered Storage,
>> and
>> >> > Secondary Cache on RocksDB and optimized the performance of Checkpoint
>> >> and
>> >> > State Restore. It seems work well for us.
>> >> >
>> >> > --
>> >> > Best,
>> >> > Yue
>> >> >
>> >>
>> >>
>> >> --
>> >> Best,
>> >> Hangxiang.
>> >>
>> >
>> >
>> >--
>> >Best,
>> >Hangxiang.
>>
>
>
>--
>Best,
>Hangxiang.

Re: [DISCUSS] FLIP-427: Disaggregated State Store

Reply via email to