Re: How do we approach Fault Tolerance in Apache Edgent

Gayashan Amarasinghe Sun, 18 Nov 2018 05:04:36 -0800

Hi Julian,

Thank you for the detailed response. Apologies for the delay. Please find
my reply below and let me know what do you think.


On Fri, Nov 9, 2018 at 8:42 AM Julian Feinauer <j.feina...@pragmaticminds.de>
wrote:

> Hi Gayashan,
>
>
>
> first, thanks for bringing your ideas to the list.
>
> I know that fault tolerance is very important in many stream processing
> applications and indeed we also have some failure handling in our
> application.
>
> And I agree with you that it doesn’t come without Performance Penalty
> (which is usually fine at the Edge, as in our cases our machines are „big
> enough“).
>

I think the performance penalty could become a concern in some edge
devices, but as long as it is configurable, I don't think we should
worry about it. There's also no way to remove this penalty, we can only
reduce the impact.


>
>
>
> But I’m unsure what Kind of failure handling is needed. At the Edge (in
> contrast to the Cloud) you have only a single instance running. So whenever
> the Gateway (or Device) dies, you have no fallback to switch over. And
> usually thats fine, because you have many single Points of failures in Edge
> applications (Network, Powersupply, Device itself, …).
>
>
So would you agree with me if I said that in the type of applications
edgent is concerned about, the intermittent data loss due to failures is
acceptable, as long as the edge device and the application can detect and
recover from the failure? This would be an easier case in a stateless
operator but for a stateful operator the data loss will be a problem.


>
>
> So what I think is really important are to be save against Bugs or Code
> changes / updates. Therefore, for example, we never process a stream
> directly but route it over a „Buffer“ (think of Kafka, but in small) to
> enable „backpressure“ and especially restartability of the processing
> engine or (partial) re-processing.
>

I think what you mean here is a source preservation mechanism in case of a
failure. So the part of the stream that was not processed due to failure,
could be replayed (and reprocessed by the downstream operators). This is
required if we are concerned about the data loss. But it has a high impact
on the latency, so for latency critical applications i don't think it is an
ideal solution.


>
>
>
> What I would like to have is something like the Operator or Partition
> state from Apache Flink [1] to allow your internal state to be „kept“ (to
> reproduce from checkpoints) whenever Problems occur.
>

I read through the Flink paper, and the checkpointing mechanism
(Asynchronous Barrier Snapshotting) for internal states sounds similar to
what's done in an old streaming paper -- Mobistreams [1]. I agree that it
would make sense to have that. But the problem is edgent doesn't have a
centralised coordinator to trigger injection of checkpoint tokens to the
stream. It matters specially in a topology with multiple sources to ensure
consistency.


> In our Industry applications we usually also have at leas once situations
> or in many cases even idempotent operations where we can live fine with at
> least once guarantees which makes Things way more comfortable.
>
>
Just to get an idea, can you give me high level details about few
applications that you have seen out there? And in such applications which
is the highest concern in your opinion, increasing the throughput or
decreasing the latency?

Thank you.

[1] https://dspace.mit.edu/openaccess-disseminate/1721.1/100987

Best,
Gayashan


>
>
> Best
>
> Julian
>
>
>
> ________________________________
> Von: Gayashan Amarasinghe <gayashan.amarasin...@gmail.com>
> Gesendet: Wednesday, November 7, 2018 1:24:41 AM
> An: dev@edgent.apache.org
> Betreff: How do we approach Fault Tolerance in Apache Edgent
>
> Hi all,
>
> Fault tolerance is a wide subject that can be approached in many ways (but
> may be not fully achieved without a performance degradation?). With the
> edge devices that we are focusing on edgent, I think faults are very much
> an expected phenomenon. There could be node failures, network failures,
> software bugs or exceptions in the code, limited resources, etc. And
> therefore some form of fault tolerance could be beneficial for some use
> cases.
>
> When it comes to fault tolerance, there seems to be two main concepts;
> replication and checkpointing. They both have their pros and cons. And with
> fault tolerance there is the requirement of recovery. And there are many
> ways of recovering from a failure as well.
>
> But before going in to that I thought to ask from the list, what are
> your thoughts on this. Do you think fault tolerance is required from a real
> user/industry perspective? Do you have experiences where some form of fault
> tolerance should have been implemented? Lets have a discussion on this if
> possible.
>
> (Special thanks should go to Julian for prompting this discussion on
> another thread.)
>
> /Gayashan
>

Re: How do we approach Fault Tolerance in Apache Edgent

Reply via email to