Hi Di, Thanks for your proposal. +1 for the contribution. I'd like to know your thoughts about the following questions:
1. According to your clarification of the exactly-once, thanks for it BTW, no PreCommitTopology is required. Does it make sense to let DorisSink[1] implement SupportsCommitter, since the TwoPhaseCommittingSink is deprecated[2] before turning the Doris connector into a Flink connector? 2. OLAP engines are commonly used as the tail/downstream of a data pipeline to support further e.g. ad-hoc query or cube with feasible pre-aggregation. Just out of curiosity, would you like to share some real use cases that will use OLAP engines as the source of a streaming data pipeline? Or it will only be used as the source for the batch? 3. The E2E test only covered sink[3], if I am not mistaken. Would you like to test the source in E2E too? [1] https://github.com/apache/doris-flink-connector/blob/43e0e5cf9b832854ea228fb093077872e3a311b6/flink-doris-connector/src/main/java/org/apache/doris/flink/sink/DorisSink.java#L55 [2] https://cwiki.apache.org/confluence/display/FLINK/FLIP-372%3A+Enhance+and+synchronize+Sink+API+to+match+the+Source+API [3] https://github.com/apache/doris-flink-connector/blob/43e0e5cf9b832854ea228fb093077872e3a311b6/flink-doris-connector/src/test/java/org/apache/doris/flink/tools/cdc/MySQLDorisE2ECase.java#L96 Best regards, Jing On Tue, Mar 5, 2024 at 11:18 AM wudi <676366...@qq.com.invalid> wrote: > Hi, Jeyhun Karimov. > Thanks for your question. > > - How to ensure Exactly-Once? > 1. When the Checkpoint Barrier arrives, DorisSink will trigger the > precommit api of StreamLoad to complete the persistence of data in Doris > (the data will not be visible at this time), and will also pass this TxnID > to the Committer. > 2. When this Checkpoint of the entire Job is completed, the Committer will > call the commit api of StreamLoad and commit TxnID to complete the > visibility of the transaction. > 3. When the task is restarted, the Txn with successful precommit and > failed commit will be aborted based on the label-prefix, and Doris' abort > API will be called. (At the same time, Doris will also abort transactions > that have not been committed for a long time) > > ps: At the same time, this part of the content has been updated in FLIP > > - Because the default table model in Doris is Duplicate ( > https://doris.apache.org/docs/data-table/data-model/), which does not > have a primary key, batch writing may cause data duplication, but UNIQ The > model has a primary key, which ensures the idempotence of writing, thus > achieving Exactly-Once > > Brs, > di.wu > > > > 2024年3月2日 17:50,Jeyhun Karimov <je.kari...@gmail.com> 写道: > > > > Hi, > > > > Thanks for the proposal. +1 for the FLIP. > > I have a few questions: > > > > - How exactly the two (Stream Load's two-phase commit and Flink's > two-phase > > commit) combination will ensure the e2e exactly-once semantics? > > > > - The FLIP proposes to combine Doris's batch writing with the primary key > > table to achieve Exactly-Once semantics. Could you elaborate more on > that? > > Why it is not the default behavior but a workaround? > > > > Regards, > > Jeyhun > > > > On Sat, Mar 2, 2024 at 10:14 AM Yanquan Lv <decq12y...@gmail.com> wrote: > > > >> Thanks for driving this. > >> The content is very detailed, it is recommended to add a section on Test > >> Plan for more completeness. > >> > >> Di Wu <d...@apache.org> 于2024年1月25日周四 15:40写道: > >> > >>> Hi all, > >>> > >>> Previously, we had some discussions about contributing Flink Doris > >>> Connector to the Flink community [1]. I want to further promote this > >> work. > >>> I hope everyone will help participate in this FLIP discussion and > provide > >>> more valuable opinions and suggestions. > >>> Thanks. > >>> > >>> [1] https://lists.apache.org/thread/lvh8g9o6qj8bt3oh60q81z0o1cv3nn8p > >>> > >>> Brs, > >>> di.wu > >>> > >>> > >>> > >>> On 2023/12/07 05:02:46 wudi wrote: > >>>> > >>>> Hi all, > >>>> > >>>> As discussed in the previous email [1], about contributing the Flink > >>> Doris Connector to the Flink community. > >>>> > >>>> > >>>> Apache Doris[2] is a high-performance, real-time analytical database > >>> based on MPP architecture, for scenarios where Flink is used for data > >>> analysis, processing, or real-time writing on Doris, Flink Doris > >> Connector > >>> is an effective tool. > >>>> > >>>> At the same time, Contributing Flink Doris Connector to the Flink > >>> community will further expand the Flink Connectors ecosystem. > >>>> > >>>> So I would like to start an official discussion FLIP-399: Flink > >>> Connector Doris[3]. > >>>> > >>>> Looking forward to comments, feedbacks and suggestions from the > >>> community on the proposal. > >>>> > >>>> [1] https://lists.apache.org/thread/lvh8g9o6qj8bt3oh60q81z0o1cv3nn8p > >>>> [2] > >> https://doris.apache.org/docs/dev/get-starting/what-is-apache-doris/ > >>>> [3] > >>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-399%3A+Flink+Connector+Doris > >>>> > >>>> > >>>> Brs, > >>>> > >>>> di.wu > >>>> > >>> > >> > >