Thanks for launching this topic xiaogang!

I also heard of this requirement from users before and I agree it could bring 
benefits for some scenarios.
As we know, the fault tolerance is one of the biggest challenges in stream 
architecuture, because it is difficult to change if the initial system design 
is not fully considering it.

Flink already provides two basic failover strategies: 
Restart-all for pipelined mode which is assumed as light-weight if checkpoint 
could be done quickly to make small states restore during restarting.
Region-based for blocking mode which only needs to restart the taks within a 
region. 
In coming release-1.9, we made much efforts for FLIP-1 and parttition 
management for only restarting the failed tasks if the consumed partitiosn are 
still available in ideal condition.

If we want to further provide more ways for fault tolerance like at-most-once, 
we need to measure/tradeoff the efforts with benefits. So it might be better to 
give a detail design and measure how much efforts to be paid.
I have the similar concerns as Piotr and from my previous experience of 
failover improvment in alibaba, it involves in many big changes and touches 
many components. We ever made big efforts to adjust the network behavior
for this issue and still seems not very clean. Because atm if one task fails, 
the corresponding consumer/producer sides would also fail via network 
communication and releases the partition/gate completely.

Best,
Zhijiang
------------------------------------------------------------------
From:Zhu Zhu <reed...@gmail.com>
Send Time:2019年6月11日(星期二) 17:36
To:dev <dev@flink.apache.org>
Subject:Re: [DISCUSS] Allow at-most-once delivery in case of failures

Thanks Xiaogang for initiating the discussion. I think it is a very good
proposal.
We also received this requirements for Flink from Alibaba internal and
external customers.
In these cases, users are less concerned of the data consistency, but have
higher demands for low latency.

Here are a couple of things to consider:
1. "at-most-once"? or no guarantee?
   "at-most-once" semantics seems not to be necessary. Data loss and
duplication are accepted as long as the inconsistency is under certain
threshold.
   Data duplications still happens when failed task get recovered
individually. Extra de-dupe efforts are needed for "at-most-once".
2. Inconsistency measurement
   Although users are less concerned of the data consistency, too much data
inconsistency is not accepted as well.
   A measurement for data inconsistency is needed for monitoring and
alerting.
3. Auto recovery
  An auto recovery mechanism is needed to recover the job to a normal state
if the inconsistency goes beyond acceptable values.


Overall I think this individual failover mechanism would be very helpful in
some cases.
In Alibaba Blink, a best effort individual failover strategy is also added
for this purpose to support customers.






Zili Chen <wander4...@gmail.com> 于2019年6月11日周二 下午4:54写道:

> Hi Xiaogang,
>
> It is an interesting topic.
>
> Notice that there is some effort to build a mature mllib of flink these
> days, it could be also possible for some ml cases trade off correctness for
> timeliness or throughput. Excatly-once delivery excatly makes flink stand
> out but an at-most-once option would adapt flink to more scenarios.
>
> Best,
> tison.
>
>
> SHI Xiaogang <shixiaoga...@gmail.com> 于2019年6月11日周二 下午4:33写道:
>
> > Flink offers a fault-tolerance mechanism to guarantee at-least-once and
> > exactly-once message delivery in case of failures. The mechanism works
> well
> > in practice and makes Flink stand out among stream processing systems.
> >
> > But the guarantee on at-least-once and exactly-once delivery does not
> come
> > without price. It typically requires to restart multiple tasks and fall
> > back to the place where the last checkpoint is taken. (Fined-grained
> > recovery can help alleviate the cost, but it still needs certain efforts
> to
> > recover jobs.)
> >
> > In some senarios, users perfer quick recovery and will trade correctness
> > off. For example, in some online recommendation systems, timeliness is
> far
> > more important than consistency. In such cases, we can restart only those
> > failed tasks individually, and do not need to perform any rollback.
> Though
> > some messages delivered to failed tasks may be lost, other tasks can
> > continuously provide service to users.
> >
> > Many of our users are demanding for at-most-once delivery in Flink. What
> do
> > you think of the proposal? Any feedback is appreciated.
> >
> > Regards,
> > Xiaogang Shi
> >
>

Reply via email to