Re: [DISCUSS] PIP-30: Improvement For Paimon Committer In Flink

Xintong Song Tue, 11 Feb 2025 21:45:30 -0800

Thanks for the clarification, Yong. Sounds good to me.

Best,


Xintong



On Wed, Feb 12, 2025 at 10:15 AM wj wang <hongli....@gmail.com> wrote:

> To @Yanquan Lv
> The concept of 'Region' is in Blink or old Flink version.
> Now the 'Region' is means `Restart Pipelined Region Failover Strategy` in
> Flink.
>
> On Wed, Feb 12, 2025 at 9:35 AM Yanquan Lv <decq12y...@gmail.com> wrote:
> >
> > Hi Fang Yong. Thanks for driving it.
> > Adding an alternative way to complete the global commit is good for me.
> >
> > I still have a confusion similar with xiangyu feng,
> > > This committer node will connect all the tasks in Flink job together,
> and all the tasks are within one region. As a result, when any task fails,
> it will trigger a global failover of the Flink job.
> > Why would the failure of a subtask have impact on other subtasks? The
> explanation on how to form a region in the Flink document is not
> particularly detailed. Can you explain more on it.
> >
> >
> > > 2025年2月11日 16:55，Yong Fang <zjur...@gmail.com> 写道：
> > >
> > > Thanks for all the feedback.
> > >
> > > To @xiangyu feng:
> > > 1. Regarding region failover, I'm referring to the `Restart Pipelined
> > > Region Failover Strategy` [1] in Flink.
> > > 2. This feature mainly enables support for region failover, enhancing
> > > stability. Of course, in practical use, users can configure the JM to
> > > provide more suitable resources for the committer node.
> > >
> > > To @Xintong Song:
> > > It doesn't matter if the notifyComplete of one checkpoint fails to be
> > > called. Since the data will be stored in the HDFS path of the
> coordinator,
> > > when the notifyComplete of the next CP is called, the data files
> generated
> > > by the two CPs will be merged to create a Paimon snapshot.
> > >
> > > To @Yunfeng Zhou:
> > > The pressure on the JM (JobManager) depends on two aspects:
> > > Message communication pressure: It is determined by the checkpoint (CP)
> > > interval and the number of sinks. Each sink generates one message in
> each
> > > CP. If there are a total of 5000 sinks, the JM will receive 5000
> messages
> > > in one CP, which may take about a few seconds.
> > > Pressure from executing table commits: It is necessary to create a
> Paimon
> > > snapshot, and compaction may even be triggered. Therefore, asynchronous
> > > threads should be considered for these operations to avoid blocking the
> > > main thread of the JM.
> > >
> > > To @Jingsong and @wj wang
> > > When using Flink SinkV2 to write to Paimon, a global committer node
> will
> > > also be generated in the physical execution plan to create and manage
> > > Paimon snapshots. Therefore, the same problem arises
> > >
> > > [1]
> > >
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/task_failure_recovery/#restart-pipelined-region-failover-strategy
> > >
> > > Best,
> > > Fang Yong
> > >
> > > On Tue, Feb 11, 2025 at 11:26 AM wj wang <hongli....@gmail.com> wrote:
> > >
> > >> Hi Yong, thanks for driving this PIP.
> > >> I have a small question:
> > >> Why not use Flink SinkV2 instead of moving the committer logic into
> > >> JM's OperatorCoordinator?
> > >>
> > >> On Tue, Feb 11, 2025 at 10:34 AM Jingsong Li <jingsongl...@gmail.com>
> > >> wrote:
> > >>>
> > >>> Thanks Yong for driving this PIP!
> > >>>
> > >>>>> Currently, there will be a global committer node in Flink Paimon
> job
> > >> which is used to ensure the consistency of written data in Paimon.
> This
> > >> committer node will connect all the tasks in Flink job together, and
> all
> > >> the tasks are within one region. As a result, when any task fails, it
> will
> > >> trigger a global failover of the Flink job. We use HDFS as the remote
> > >> storage, and we often encounter situations where the global failover
> of
> > >> jobs is triggered due to write timeouts or errors when writing to
> HDFS,
> > >> which are quite a few stability issues.
> > >>>
> > >>> I know that Flink SinkV2 is also committed through a regular node,
> > >>> does this mean that SinkV2 also has this drawback?
> > >>>
> > >>> Best,
> > >>> Jingsong
> > >>>
> > >>> On Mon, Feb 10, 2025 at 4:06 PM Yunfeng Zhou
> > >>> <flink.zhouyunf...@gmail.com> wrote:
> > >>>>
> > >>>> Hi Yong,
> > >>>>
> > >>>> The general idea looks good to me. Is there any statistics on the
> > >> number of operator events that need to be transmitted between the
> > >> coordinator and the writer operators? This information could help
> provide
> > >> estimations on the additional workload to the JM, preventing the JM
> from
> > >> being a single bottleneck to the throughput of Paimon sinks.
> > >>>>
> > >>>> Best,
> > >>>> Yunfeng
> > >>>>
> > >>>>> 2025年1月23日 17:44，Yong Fang <zjur...@gmail.com> 写道：
> > >>>>>
> > >>>>> Hi devs,
> > >>>>>
> > >>>>> I would like to start a discussion about PIP-30: Improvement For
> > >> Paimon
> > >>>>> Committer In Flink [1].
> > >>>>>
> > >>>>> Currently Flink writes data to Paimon based on Two-Phase Commit
> > >> which will
> > >>>>> generate a global committer node and connect all tasks in one
> > >> region. If
> > >>>>> any task fails, it will lead to a global failover in Flink job.
> > >>>>>
> > >>>>> To solve this issue, we would like to introduce a Paimon Writer
> > >> Coordinator
> > >>>>> to perform table commit operation, enabling Flink paimon jobs to
> > >> support
> > >>>>> region failover and improving stability.
> > >>>>>
> > >>>>> Looking forward to hearing from you, thanks!
> > >>>>>
> > >>>>> [1]
> > >>>>>
> > >>
> https://cwiki.apache.org/confluence/display/PAIMON/PIP-30%3A+Improvement+For+Paimon+Committer+In+Flink
> > >>>>>
> > >>>>>
> > >>>>> Best,
> > >>>>> Fang Yong
> > >>>>
> > >>
> >
>

Re: [DISCUSS] PIP-30: Improvement For Paimon Committer In Flink

Reply via email to