Hi, Weihua, Thanks for the questions and the ideas. > 1. How many performance regressions would there be if we only used remote storage?
The new architecture can support to use remote storage only, but this FLIP target is to improve job stability. And the change in the FLIP has been significantly complex and the goal of the first version is to update Hybrid Shuffle to the new architecture and support remote storage as a supplement. The performance of this version is not the first priority, so we haven’t tested the performance of using only remote storage. If there are indeed regressions, we will keep optimizing the performance of the remote storages and improve it until only remote storage is available in the production environment. > 2. Shall we move the local data to remote storage if the producer is finished for a long time? I agree that it is a good idea, which can release task manager resources more timely. But moving data from TM local disk to remote storage needs more detailed discussion and design, and it is easier to implement it based on the new architecture. Considering the complexity, the target focus, and the iteration cycle of the FLIP, we decide that the details are not included in the first version. We will extend and implement them in the subsequent versions. Best, Yuxin Weihua Hu <huweihua....@gmail.com> 于2023年3月9日周四 11:22写道: > Hi, Yuxin > > Thanks for driving this FLIP. > > The remote storage shuffle could improve the stability of Batch jobs. > > In our internal scenario, we use a hybrid cluster to run both > Streaming(high priority) > and Batch jobs(low priority). When there is not enough resources(such as > cpu usage > reaches a threshold), the batch containers will be evicted. So this will > cause some re-run > of batch tasks. > > It would be a great help if the remote storage could address this. So I > have a few questions. > > 1. How many performance regressions would there be if we only used remote > storage? > > 2. In current design, the shuffle data segment will write to one kind of > storage tier. > Shall we move the local data to remote storage if the producer is finished > for a long time? > So we can release the idle task manager with no shuffle data on it. This > may help to reduce > the resource usage when producer parallelism is larger than consume. > > Best, > Weihua > > > On Thu, Mar 9, 2023 at 10:38 AM Yuxin Tan <tanyuxinw...@gmail.com> wrote: > > > Hi, Junrui, > > Thanks for the suggestions and ideas. > > > > > If they are fixed, I suggest that FLIP could provide clearer > > explanations. > > I have updated the FLIP and described the segment size more clearly. > > > > > can we provide configuration options for users to manually adjust the > > sizes? > > The segment size can be configured if necessary. But considering that if > we > > exposed these parameters prematurely, it may be difficult to modify the > > implementation later because the user has already used the configs. We > > can make these internal configs or fixed values when implementing the > first > > version, I think we can use either of these two ways, because they are > > internal and do not affect the public APIs. > > > > Best, > > Yuxin > > > > > > Junrui Lee <jrlee....@gmail.com> 于2023年3月8日周三 00:24写道: > > > > > Hi Yuxin, > > > > > > This FLIP looks quite reasonable. Flink can solve the problem of Batch > > > shuffle by > > > combining local and remote storage, and can use fixed local disks for > > > better performance > > > in most scenarios, while using remote storage as a supplement when > local > > > disks are not > > > sufficient, avoiding wasteful costs and poor job stability. Moreover, > > the > > > solution also > > > considers the issue of dynamic switching, which can automatically > switch > > to > > > remote > > > storage when the local disk is full, saving costs, and automatically > > switch > > > back when > > > there is available space on the local disk. > > > > > > As Wencong Liu stated, an appropriate segment size is essential, as it > > can > > > significantly > > > affect shuffle performance. I also agree that the first version should > > > focus mainly on the > > > design and implementation. However, I have a small question about > FLIP. I > > > did not see > > > any information regarding the segment size of memory, local disk, and > > > remote storage > > > in this FLIP. Are these three values fixed at present? If they are > > fixed, I > > > suggest that FLIP > > > could provide clearer explanations. Moreover, although a dynamic > segment > > > size > > > mechanism is not necessary at the moment, can we provide configuration > > > options for users > > > to manually adjust these sizes? I think it might be useful. > > > > > > Best, > > > Junrui. > > > > > > Yuxin Tan <tanyuxinw...@gmail.com> 于2023年3月7日周二 20:14写道: > > > > > > > Thanks for joining the discussion. > > > > > > > > @weijie guo > > > > > 1. How to optimize the broadcast result partition? > > > > For the partitions with multi-consumers, e.g., broadcast result > > > partition, > > > > partition reuse, > > > > speculative, etc, the processing logic is the same as the original > > Hybrid > > > > Shuffle, that is, > > > > using the full spilling strategy. It indeed may reduce the > opportunity > > to > > > > consume from > > > > memory, but the PoC shows that it has no effect on the performance > > > > basically. > > > > > > > > > 2. Can the new proposal completely avoid this problem of inaccurate > > > > backlog > > > > calculation? > > > > Yes, this can avoid the problem completely. About the read buffers, > > the N > > > > is to reserve > > > > one exclusive buffer per channel, which is to avoid the deadlock > > because > > > > the buffers > > > > are acquired by some channels and other channels can not request any > > > > buffers. But > > > > the buffers except for the N can be floating (competing to request > the > > > > buffers) by all > > > > channels. > > > > > > > > @Wencong Liu > > > > > Deciding the Segment size dynamically will be helpful. > > > > I agree that it may be better if the segment size is dynamically > > decided, > > > > but for simplifying > > > > the implementation of the first version, we want to make this a fixed > > > value > > > > for each tier. > > > > In the future, this can be a good improvement if necessary. In the > > first > > > > version, we will mainly > > > > focus on the more important features, such as the tiered storage > > > > architecture, dynamic > > > > switching tiers, supporting remote storage, memory management, etc. > > > > > > > > Best, > > > > Yuxin > > > > > > > > > > > > Wencong Liu <liuwencle...@163.com> 于2023年3月7日周二 16:48写道: > > > > > > > > > Hello Yuxin, > > > > > > > > > > > > > > > Thanks for your proposal! Adding remote storage capability to > > > Flink's > > > > > Hybrid Shuffle is a significant improvement that addresses the > issue > > of > > > > > local disk storage limitations. This enhancement not only ensures > > > > > uninterrupted Shuffle, but also enables Flink to handle larger > > > workloads > > > > > and more complex data processing tasks. With the ability to > > seamlessly > > > > > shift between local and remote storage, Flink's Hybrid Shuffle will > > be > > > > more > > > > > versatile and scalable, making it an ideal choice for organizations > > > > looking > > > > > to build distributed data processing applications with ease. > > > > > Besides, I've a small question about the size of Segment in > > > different > > > > > storages. According to the FLIP, the size of Segment may be fixed > for > > > > each > > > > > Storage Tier, but I think the fixed size may affect the shuffle > > > > > performance. For example, smaller segment size will improve the > > > > utilization > > > > > rate of Memory Storage Tier, but it may brings extra cost to Disk > > > Storage > > > > > Tier or Remote Storage Tier. Deciding the size of Segment dynamicly > > > will > > > > be > > > > > helpful. > > > > > > > > > > Best, > > > > > > > > > > > > > > > Wencong Liu > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > At 2023-03-06 13:51:21, "Yuxin Tan" <tanyuxinw...@gmail.com> > wrote: > > > > > >Hi everyone, > > > > > > > > > > > >I would like to start a discussion on FLIP-301: Hybrid Shuffle > > > supports > > > > > >Remote Storage[1]. > > > > > > > > > > > >In the cloud-native environment, it is difficult to determine the > > > > > >appropriate > > > > > >disk space for Batch shuffle, which will affect job stability. > > > > > > > > > > > >This FLIP is to support Remote Storage for Hybrid Shuffle to > improve > > > the > > > > > >Batch job stability in the cloud-native environment. > > > > > > > > > > > >The goals of this FLIP are as follows. > > > > > >1. By default, use the local memory and disk to ensure high > shuffle > > > > > >performance if the local storage space is sufficient. > > > > > >2. When the local storage space is insufficient, use remote > storage > > as > > > > > >a supplement to avoid large-scale Batch job failure. > > > > > > > > > > > >Looking forward to hearing from you. > > > > > > > > > > > >[1] > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage > > > > > > > > > > > >Best, > > > > > >Yuxin > > > > > > > > > > > > > > >