Hi Nicholas, Thanks for the valuable feedbacks.
> 1. Could you describe in detail what functions the relevant components mentioned in Proposed Changes These components are only the pluggable implementations of the Celeborn tier. The details and the mechanisms of switching between tiers are in the previous FLIP[1]. The Celeborn, as a new tier, is added to hybrid shuffle, sharing the similarities with existing tiers, such as the Memory tier and Disk tier. In this tiered storage, agents serve as the entry points of interaction between the framework and different tiers. For instance, CelebornProducerAgent acts as the entry point for producers to emit data into the tier. If there are still more similar questions after referencing that FLIP, please feel free to let me know. > 2. Can you briefly introduce how to guarantee compatibility with Celeborn’s existing features such as partition splitting? This integration work is a new way to make Celeborn work with Flink, so the compatibility of the old shuffle service mode is not affected. The new integration will also support the features of the old mode, e.g., the partition split will be supported by trying to open the stream from the next partition when the previous partition is read completely. Since these features are all implementation details, initially I didn't add them in the CIP to keep it focused, simple, and easy to understand. After the question, I have added some feature details to it. > 3. Is there any public configuration of integration with Hybrid Shuffle and Flink client? Yes, there is an added Flink configuration, which is described in the FLIP[2]. > 4. How does the server side guarantee the accuracy and recoverability of Segment information? Similar to other writing information, the segment info is also added to FileInfo. and the lock can protect it to guarantee accuracy. The recoverability is achieved by serialization and deserialization, which is also the same as other fields. > 5. Should Celeborn wait until FLIP-459 is released before releasing this integration? Which Flink version will release FLIP-459? Celeborn's integration should wait for FLIP-459 to be released. This is because the feature relies on both CIP-6 and FLIP-459 to function correctly. If all goes well, FLIP-459 could be part of Flink's next release, Flink 1.20. Hi, Keyong, Thanks for the reminder and the interest in the Reduce Partition. After the Map Partition part is finished, we will continue to work on it as soon as possible. [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-301%3A+Hybrid+Shuffle+supports+Remote+Storage [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn Best, Yuxin Keyong Zhou <[email protected]> 于2024年6月8日周六 13:00写道: > Hi Yuxin and Xintong, > > Really excited to see Flink and Celeborn communities collaborate > more on shuffle component! I believe this will inspire more for both sides > :) > > +1 for this proposal, looking forward to see this feature to make progress. > > Also I'm very interested in integrating Flink Hybrid Shuffle with > Celeborn's > Reduce Partition as mentioned in the doc in the future, which I believe > will > benefit more for very large shuffle operators :) > > Regards, > Keyong Zhou > > Nicholas Jiang <[email protected]> 于2024年6月6日周四 13:25写道: > > > Hi Yuxin, > > > > Thanks for driving this CIP about integration with Hybrid Shuffle. I have > > some comments on this CIP: > > > > 1. Could you describe in detail what functions the relevant components > > mentioned in Proposed Changes, including CelebornProducerAgent, > > CelebornConsumerAgent, CelebornMasterAgent, etc., support? In the design > > document, these components are only mentioned and no any details of > changes. > > > > 2. Can you briefly introduce how to guarantee compatibility with > > Celeborn’s existing features such as partition splitting? IMO, the > > compatibility introduction should be mentioned in Proposed Changes to > help > > community developers understand. > > > > 3. There are no changes on public interfaces. Is there any public > > configuration of integration with Hybrid Shuffle and Flink client? > > > > 4. The server side must store Segment information for each subpartition. > > How does the server side guarantee the accuracy and recoverability of > > Segment information? > > > > 5. Should Celeborn wait until FLIP-459 is released before releasing this > > integration? Which Flink version will release FLIP-459? > > > > Regards, > > Nicholas Jiang > > > > On 2024/05/28 12:51:32 Yuxin Tan wrote: > > > Hi all, > > > > > > I would like to start a discussion on CIP-6 Support Flink hybrid > shuffle > > > integration with Apache > > > Celeborn[1]. Celeborn provides a stable, performant, scalable remote > > > shuffle service. > > > Concurrently, Flink hybrid shuffle supports transitions between memory, > > > disk, and remote > > > storage to improve performance and job stability. This integration > > proposal > > > is to harness the > > > benefits from both Celeborn and hybrid shuffle simultaneously. > > > > > > Note that this proposal has two parts. > > > 1. The Celeborn-side changes are in CIP-6[1]. > > > 2. The Flink-side modifications are in FLIP-459[2]. > > > > > > Looking forward to everyone's feedback and suggestions. Thank you! > > > > > > [1] > > > > > > https://cwiki.apache.org/confluence/display/CELEBORN/CIP-6+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn > > > [2] > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-459%3A+Support+Flink+hybrid+shuffle+integration+with+Apache+Celeborn > > > > > > Best, > > > Yuxin > > > > > >
