Sounds good to me. It is true that, if we are introducing the generalized watermark, there will be other watermark related concepts / configurations that need to be updated anyway.
Best, Xintong On Tue, Aug 15, 2023 at 11:30 AM Xuannan Su <suxuanna...@gmail.com> wrote: > Hi Xingtong, > > Thank you for your suggestion. > > After considering the idea of using a general configuration key, I think > it may not be a good idea for the reasons below. > > While I agree that using a more general configuration key provides us with > the flexibility to switch to other approaches to calculate the lag in the > future, the downside is that it may cause confusion for users. We currently > have fetchEventTimeLag, emitEventTimeLag, and watermarkLag in the source, > and it is not clear which specific lag we are referring to. With the > potential introduction of the Generalized Watermark mechanism in the > future, if I understand correctly, a watermark won't necessarily need to be > a timestamp. I am concern that the general configuration key may not be > enough to cover all the use case and we will need to introduce a general > way to determine the backlog status regardless. > > For the reasons above, I prefer introducing the configuration as is, and > change it later with the a deprecation process or migration process. What > do you think? > > Best, > Xuannan > On Aug 14, 2023, 14:09 +0800, Xintong Song <tonysong...@gmail.com>, wrote: > > Thanks for the explanation. > > > > I wonder if it makes sense to not expose this detail via the > configuration > > option. To be specific, I suggest not mentioning the "watermark" keyword > in > > the configuration key and description. > > > > - From the users' perspective, I think they only need to know there's a > > lag higher than the given threshold, Flink will consider latency of > > individual records as less important and prioritize throughput over it. > > They don't really need the details of how the lags are calculated. > > - For the internal implementation, I also think using watermark lags is > > a good idea, for the reasons you've already mentioned. However, it's not > > the only possible option. Hiding this detail from users would give us the > > flexibility to switch to other approaches if needed in future. > > - We are currently working on designing the ProcessFunction API > > (consider it as a DataStream API V2). There's an idea to introduce a > > Generalized Watermark mechanism, where basically the watermark can be > > anything that needs to travel along the data-flow with certain alignment > > strategies, and event time watermark would be one specific case of it. > This > > is still an idea and has not been discussed and agreed on by the > community, > > and we are preparing a FLIP for it. But if we are going for it, the > concept > > "watermark-lag-threshold" could be ambiguous. > > > > I do not intend to block the FLIP on this. I'd also be fine with > > introducing the configuration as is, and changing it later, if needed, > with > > a regular deprecation and migration process. Just making my suggestions. > > > > > > Best, > > > > Xintong > > > > > > > > On Mon, Aug 14, 2023 at 12:00 PM Xuannan Su <suxuanna...@gmail.com> > wrote: > > > > > Hi Xintong, > > > > > > Thanks for the reply. > > > > > > I have considered using the timestamp in the records to determine the > > > backlog status, and decided to use watermark at the end. By definition, > > > watermark is the time progress indication in the data stream. It > indicates > > > the stream’s event time has progressed to some specific time. On the > other > > > hand, timestamp in the records is usually used to generate the > watermark. > > > Therefore, it appears more appropriate and intuitive to calculate the > event > > > time lag by watermark and determine the backlog status. And by using > the > > > watermark, we can easily deal with the out-of-order and the idleness > of the > > > data. > > > > > > Please let me know if you have further questions. > > > > > > Best, > > > Xuannan > > > On Aug 10, 2023, 20:23 +0800, Xintong Song <tonysong...@gmail.com>, > wrote: > > > > Thanks for preparing the FLIP, Xuannan. > > > > > > > > +1 in general. > > > > > > > > A quick question, could you explain why we are relying on the > watermark > > > for > > > > emitting the record attribute? Why not use timestamps in the > records? I > > > > don't see any concern in using watermarks. Just wondering if there's > any > > > > deep considerations behind this. > > > > > > > > Best, > > > > > > > > Xintong > > > > > > > > > > > > > > > > On Thu, Aug 3, 2023 at 3:03 PM Xuannan Su <suxuanna...@gmail.com> > wrote: > > > > > > > > > Hi all, > > > > > > > > > > I am opening this thread to discuss FLIP-328: Allow source > operators to > > > > > determine isProcessingBacklog based on watermark lag[1]. We had a > > > several > > > > > discussions with Dong Ling about the design, and thanks for all the > > > > > valuable advice. > > > > > > > > > > The FLIP aims to target the use-case where user want to run a Flink > > > job to > > > > > backfill historical data in a high throughput manner and continue > > > > > processing real-time data with low latency. Building upon the > backlog > > > > > concept introduced in FLIP-309[2], this proposal enables sources to > > > report > > > > > their status of processing backlog based on the watermark lag. > > > > > > > > > > We would greatly appreciate any comments or feedback you may have > on > > > this > > > > > proposal. > > > > > > > > > > Best, > > > > > Xuannan > > > > > > > > > > > > > > > [1] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-328%3A+Allow+source+operators+to+determine+isProcessingBacklog+based+on+watermark+lag > > > > > [2] > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-309%3A+Support+using+larger+checkpointing+interval+when+source+is+processing+backlog > > > > > > > > >