Sounds good to me.

It is true that, if we are introducing the generalized watermark, there
will be other watermark related concepts / configurations that need to be
updated anyway.


Best,

Xintong



On Tue, Aug 15, 2023 at 11:30 AM Xuannan Su <suxuanna...@gmail.com> wrote:

> Hi Xingtong,
>
> Thank you for your suggestion.
>
> After considering the idea of using a general configuration key, I think
> it may not be a good idea for the reasons below.
>
> While I agree that using a more general configuration key provides us with
> the flexibility to switch to other approaches to calculate the lag in the
> future, the downside is that it may cause confusion for users. We currently
> have fetchEventTimeLag, emitEventTimeLag, and watermarkLag in the source,
> and it is not clear which specific lag we are referring to. With the
> potential introduction of the Generalized Watermark mechanism in the
> future, if I understand correctly, a watermark won't necessarily need to be
> a timestamp. I am concern that the general configuration key may not  be
> enough to cover all the use case and we will need to introduce a general
> way to determine the backlog status regardless.
>
> For the reasons above, I prefer introducing the configuration as is, and
> change it later with the a deprecation process or migration process. What
> do you think?
>
> Best,
> Xuannan
> On Aug 14, 2023, 14:09 +0800, Xintong Song <tonysong...@gmail.com>, wrote:
> > Thanks for the explanation.
> >
> > I wonder if it makes sense to not expose this detail via the
> configuration
> > option. To be specific, I suggest not mentioning the "watermark" keyword
> in
> > the configuration key and description.
> >
> > - From the users' perspective, I think they only need to know there's a
> > lag higher than the given threshold, Flink will consider latency of
> > individual records as less important and prioritize throughput over it.
> > They don't really need the details of how the lags are calculated.
> > - For the internal implementation, I also think using watermark lags is
> > a good idea, for the reasons you've already mentioned. However, it's not
> > the only possible option. Hiding this detail from users would give us the
> > flexibility to switch to other approaches if needed in future.
> > - We are currently working on designing the ProcessFunction API
> > (consider it as a DataStream API V2). There's an idea to introduce a
> > Generalized Watermark mechanism, where basically the watermark can be
> > anything that needs to travel along the data-flow with certain alignment
> > strategies, and event time watermark would be one specific case of it.
> This
> > is still an idea and has not been discussed and agreed on by the
> community,
> > and we are preparing a FLIP for it. But if we are going for it, the
> concept
> > "watermark-lag-threshold" could be ambiguous.
> >
> > I do not intend to block the FLIP on this. I'd also be fine with
> > introducing the configuration as is, and changing it later, if needed,
> with
> > a regular deprecation and migration process. Just making my suggestions.
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Mon, Aug 14, 2023 at 12:00 PM Xuannan Su <suxuanna...@gmail.com>
> wrote:
> >
> > > Hi Xintong,
> > >
> > > Thanks for the reply.
> > >
> > > I have considered using the timestamp in the records to determine the
> > > backlog status, and decided to use watermark at the end. By definition,
> > > watermark is the time progress indication in the data stream. It
> indicates
> > > the stream’s event time has progressed to some specific time. On the
> other
> > > hand, timestamp in the records is usually used to generate the
> watermark.
> > > Therefore, it appears more appropriate and intuitive to calculate the
> event
> > > time lag by watermark and determine the backlog status. And by using
> the
> > > watermark, we can easily deal with the out-of-order and the idleness
> of the
> > > data.
> > >
> > > Please let me know if you have further questions.
> > >
> > > Best,
> > > Xuannan
> > > On Aug 10, 2023, 20:23 +0800, Xintong Song <tonysong...@gmail.com>,
> wrote:
> > > > Thanks for preparing the FLIP, Xuannan.
> > > >
> > > > +1 in general.
> > > >
> > > > A quick question, could you explain why we are relying on the
> watermark
> > > for
> > > > emitting the record attribute? Why not use timestamps in the
> records? I
> > > > don't see any concern in using watermarks. Just wondering if there's
> any
> > > > deep considerations behind this.
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Thu, Aug 3, 2023 at 3:03 PM Xuannan Su <suxuanna...@gmail.com>
> wrote:
> > > >
> > > > > Hi all,
> > > > >
> > > > > I am opening this thread to discuss FLIP-328: Allow source
> operators to
> > > > > determine isProcessingBacklog based on watermark lag[1]. We had a
> > > several
> > > > > discussions with Dong Ling about the design, and thanks for all the
> > > > > valuable advice.
> > > > >
> > > > > The FLIP aims to target the use-case where user want to run a Flink
> > > job to
> > > > > backfill historical data in a high throughput manner and continue
> > > > > processing real-time data with low latency. Building upon the
> backlog
> > > > > concept introduced in FLIP-309[2], this proposal enables sources to
> > > report
> > > > > their status of processing backlog based on the watermark lag.
> > > > >
> > > > > We would greatly appreciate any comments or feedback you may have
> on
> > > this
> > > > > proposal.
> > > > >
> > > > > Best,
> > > > > Xuannan
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-328%3A+Allow+source+operators+to+determine+isProcessingBacklog+based+on+watermark+lag
> > > > > [2]
> > > > >
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog
> > > > >
> > >
>

Reply via email to