Re: [DISCUSS] PIP-30: Improvement For Paimon Committer In Flink

wj wang Tue, 11 Feb 2025 18:15:29 -0800

To @Yanquan Lv
The concept of 'Region' is in Blink or old Flink version.
Now the 'Region' is means `Restart Pipelined Region Failover Strategy` in Flink.


On Wed, Feb 12, 2025 at 9:35 AM Yanquan Lv <decq12y...@gmail.com> wrote:
>
> Hi Fang Yong. Thanks for driving it.
> Adding an alternative way to complete the global commit is good for me.
>
> I still have a confusion similar with xiangyu feng,
> > This committer node will connect all the tasks in Flink job together, and 
> > all the tasks are within one region. As a result, when any task fails, it 
> > will trigger a global failover of the Flink job.
> Why would the failure of a subtask have impact on other subtasks? The 
> explanation on how to form a region in the Flink document is not particularly 
> detailed. Can you explain more on it.
>
>
> > 2025年2月11日 16:55，Yong Fang <zjur...@gmail.com> 写道：
> >
> > Thanks for all the feedback.
> >
> > To @xiangyu feng:
> > 1. Regarding region failover, I'm referring to the `Restart Pipelined
> > Region Failover Strategy` [1] in Flink.
> > 2. This feature mainly enables support for region failover, enhancing
> > stability. Of course, in practical use, users can configure the JM to
> > provide more suitable resources for the committer node.
> >
> > To @Xintong Song:
> > It doesn't matter if the notifyComplete of one checkpoint fails to be
> > called. Since the data will be stored in the HDFS path of the coordinator,
> > when the notifyComplete of the next CP is called, the data files generated
> > by the two CPs will be merged to create a Paimon snapshot.
> >
> > To @Yunfeng Zhou:
> > The pressure on the JM (JobManager) depends on two aspects:
> > Message communication pressure: It is determined by the checkpoint (CP)
> > interval and the number of sinks. Each sink generates one message in each
> > CP. If there are a total of 5000 sinks, the JM will receive 5000 messages
> > in one CP, which may take about a few seconds.
> > Pressure from executing table commits: It is necessary to create a Paimon
> > snapshot, and compaction may even be triggered. Therefore, asynchronous
> > threads should be considered for these operations to avoid blocking the
> > main thread of the JM.
> >
> > To @Jingsong and @wj wang
> > When using Flink SinkV2 to write to Paimon, a global committer node will
> > also be generated in the physical execution plan to create and manage
> > Paimon snapshots. Therefore, the same problem arises
> >
> > [1]
> > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
> >
> > Best,
> > Fang Yong
> >
> > On Tue, Feb 11, 2025 at 11:26 AM wj wang <hongli....@gmail.com> wrote:
> >
> >> Hi Yong, thanks for driving this PIP.
> >> I have a small question:
> >> Why not use Flink SinkV2 instead of moving the committer logic into
> >> JM's OperatorCoordinator?
> >>
> >> On Tue, Feb 11, 2025 at 10:34 AM Jingsong Li <jingsongl...@gmail.com>
> >> wrote:
> >>>
> >>> Thanks Yong for driving this PIP!
> >>>
> >>>>> Currently, there will be a global committer node in Flink Paimon job
> >> which is used to ensure the consistency of written data in Paimon. This
> >> committer node will connect all the tasks in Flink job together, and all
> >> the tasks are within one region. As a result, when any task fails, it will
> >> trigger a global failover of the Flink job. We use HDFS as the remote
> >> storage, and we often encounter situations where the global failover of
> >> jobs is triggered due to write timeouts or errors when writing to HDFS,
> >> which are quite a few stability issues.
> >>>
> >>> I know that Flink SinkV2 is also committed through a regular node,
> >>> does this mean that SinkV2 also has this drawback?
> >>>
> >>> Best,
> >>> Jingsong
> >>>
> >>> On Mon, Feb 10, 2025 at 4:06 PM Yunfeng Zhou
> >>> <flink.zhouyunf...@gmail.com> wrote:
> >>>>
> >>>> Hi Yong,
> >>>>
> >>>> The general idea looks good to me. Is there any statistics on the
> >> number of operator events that need to be transmitted between the
> >> coordinator and the writer operators? This information could help provide
> >> estimations on the additional workload to the JM, preventing the JM from
> >> being a single bottleneck to the throughput of Paimon sinks.
> >>>>
> >>>> Best,
> >>>> Yunfeng
> >>>>
> >>>>> 2025年1月23日 17:44，Yong Fang <zjur...@gmail.com> 写道：
> >>>>>
> >>>>> Hi devs,
> >>>>>
> >>>>> I would like to start a discussion about PIP-30: Improvement For
> >> Paimon
> >>>>> Committer In Flink [1].
> >>>>>
> >>>>> Currently Flink writes data to Paimon based on Two-Phase Commit
> >> which will
> >>>>> generate a global committer node and connect all tasks in one
> >> region. If
> >>>>> any task fails, it will lead to a global failover in Flink job.
> >>>>>
> >>>>> To solve this issue, we would like to introduce a Paimon Writer
> >> Coordinator
> >>>>> to perform table commit operation, enabling Flink paimon jobs to
> >> support
> >>>>> region failover and improving stability.
> >>>>>
> >>>>> Looking forward to hearing from you, thanks!
> >>>>>
> >>>>> [1]
> >>>>>
> >> https://cwiki.apache.org/confluence/display/PAIMON/PIP-30%3A+Improvement+For+Paimon+Committer+In+Flink
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Fang Yong
> >>>>
> >>
>

Re: [DISCUSS] PIP-30: Improvement For Paimon Committer In Flink

Reply via email to