Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

David Morávek Mon, 20 Mar 2023 04:04:31 -0700

>
> however listeners can use previous state (tags/labels) to make decisions



That sounds like a very fragile contract. We should either allow passing
tags between listeners and then need to define ordering or make all of them
independent. I prefer the latter because it allows us to parallelize things
if needed (if all listeners trigger an RCP to the external system, for
example).

Can you expand on why we need more than one classifier to be able to output
the same tag?

system ones come first and then the ones loaded from the plugin manager


Since they're returned as a Set, the order is completely non-deterministic,
no matter in which order they're loaded.

just communicating with external monitoring/alerting systems
>

That makes the need for pushing things out of the main thread even
stronger. This almost sounds like we need to return a CompletableFuture for
the per-throwable classification because an external system might take a
significant time to respond. We need to unblock the main thread for other
RPCs.

Also, in the proposal, this happens in the failure handler. If that's the
case, this might block the job from being restarted (if the restart
strategy allows for another restart), which would be great to avoid because
it can introduce extra downtime.

This raises another question: what should happen if the classification
fails? Crashing the job (which is what's currently proposed) seems very
dangerous if this might depend on an external system.

Thats a valid point, passing the JobGraph containing all the above
> information is also something to consider
>

We should avoid passing JG around because it's mutable (which we must fix
in the long term), and letting users change it might have consequences.

Best,
D.

On Mon, Mar 20, 2023 at 7:23 AM Panagiotis Garefalakis <[email protected]>
wrote:

> Hey David, Shammon,
>
> Thanks for the valuable comments!
> I am glad you find this proposal useful, some thoughts:
>
> @Shammon
>
> 1. How about adding more job information in FailureListenerContext? For
> > example, job vertext, subtask, taskmanager location. And then user can do
> > more statistics according to different dimensions.
>
>
> Thats a valid point, passing the JobGraph containing all the above
> information
> is also something to consider, I was mostly trying to be conservative:
> i.e., passingly only the information we need, and extend as we see fit
>
> 2. Users may want to save results in listener, and then they can get the
> > historical results even jabmanager failover. Can we provide a unified
> > implementation for data storage requirements?
>
>
> The idea is to store only the output of the Listeners (tags) and treat them
> as stateless.
> Tags are be stored along with HistoryEntries, and will be available through
> the HistoryServer
> even after a JM dies.
>
> @David
>
> 1) Should we also consider adding labels? The combination of tags and
> > labels seems to be what most systems offer; sometimes, they offer labels
> > only (key=value pairs) because tags can be implemented using those, but
> not
> > the other way around.
>
>
> Indeed changing tags to k:v labels could be more expressive, I like it!
> Let's see what others think.
>
> 2) Since we can not predict how heavy user-defined models ("listeners") are
> > going to be, it would be great to keep the interfaces/data structures
> > immutable so we can push things over to the I/O threads. Also, it sounds
> > off to call the main interface a Listener since it's supposed to enhance
> > the original throwable with additional metadata.
>
>
> The idea was for the name to be generic as there could be Listener
> implementations
> just communicating with external monitoring/alerting systems and no
> metadata output
> -- but lets rethink that. For immutability, see below:
>
> 3) You're proposing to support a set of listeners. Since you're passing the
> > mutable context around, which includes tags set by the previous listener,
> > do you expect users to make any assumptions about the order in which
> > listeners are executed?
>
>
> In the existing proposal we are not making any assumptions about the order
> of listeners,
> (system ones come first and then the ones loaded from the plugin manager)
> however listeners can use previous state (tags/labels) to make decisions:
> e.g., wont assign *UNKNOWN* failureType when we have already seen *USER *or
> the other way around -- when we have seen *UNKNOWN* remove in favor of
> *USER*
>
>
> Cheers,
> Panagiotis
>
> On Sun, Mar 19, 2023 at 10:42 AM David Morávek <[email protected]> wrote:
>
> > Hi Panagiotis,
> >
> > This is an excellent proposal and something everyone trying to provide
> > "Flink as a service" needs to solve at some point. I have a couple of
> > questions:
> >
> > If I understand the proposal correctly, this is just about adding tags to
> > the Throwable by running a tuple of (Throwable, FailureContext) through a
> > user-defined model.
> >
> > 1) Should we also consider adding labels? The combination of tags and
> > labels seems to be what most systems offer; sometimes, they offer labels
> > only (key=value pairs) because tags can be implemented using those, but
> not
> > the other way around.
> >
> > 2) Since we can not predict how heavy user-defined models ("listeners")
> are
> > going to be, it would be great to keep the interfaces/data structures
> > immutable so we can push things over to the I/O threads. Also, it sounds
> > off to call the main interface a Listener since it's supposed to enhance
> > the original throwable with additional metadata.
> >
> > I'd propose something along the lines of (we should have better names,
> this
> > is just to outline the idea):
> >
> > interface FailureEnricher {
> >
> >   ThrowableWithTagsAndLabels enrichFailure(Throwable cause,
> > ImmutableContextualMetadataAboutTheThrowable context);
> > }
> >
> > The names should change; this is just to outline the idea.
> >
> > 3) You're proposing to support a set of listeners. Since you're passing
> the
> > mutable context around, which includes tags set by the previous listener,
> > do you expect users to make any assumptions about the order in which
> > listeners are executed?
> >
> > *@Shammon*
> >
> > Users may want to save results in listener, and then they can get the
> > > historical results even jabmanager failover. Can we provide a unified
> > > implementation for data storage requirements?
> >
> >
> > I think we should explicitly state that all "listeners" are treated as
> > stateless. I don't see any strong reason for snapshotting them.
> >
> > Best,
> > D.
> >
> > On Sat, Mar 18, 2023 at 1:00 AM Shammon FY <[email protected]> wrote:
> >
> > > Hi Panagiotis
> > >
> > > Thank you for starting this discussion. I think this FLIP is valuable
> and
> > > can help user to analyze the causes of job failover better!
> > >
> > > I have two comments as follows
> > >
> > > 1. How about adding more job information in FailureListenerContext? For
> > > example, job vertext, subtask, taskmanager location. And then user can
> do
> > > more statistics according to different dimensions.
> > >
> > > 2. Users may want to save results in listener, and then they can get
> the
> > > historical results even jabmanager failover. Can we provide a unified
> > > implementation for data storage requirements?
> > >
> > >
> > > Best,
> > > shammon FY
> > >
> > >
> > > On Saturday, March 18, 2023, Panagiotis Garefalakis <[email protected]
> >
> > > wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > This FLIP [1] proposes a pluggable interface for failure handling
> > > allowing
> > > > users to implement custom failure logic using the plugin framework.
> > > > Motivated by existing proposals [2] and tickets [3], this enables
> > > use-cases
> > > > like: assigning particular types to failures (e.g., User or System),
> > > > emitting custom metrics per type (e.g., application or platform),
> even
> > > > exposing errors to downstream consumers (e.g., notification systems).
> > > >
> > > > Thanks to Piotr and Anton for the initial reviews and discussions!
> > > >
> > > > For anyone interested, the starting point would be the FLIP [1] that
> I
> > > > created,
> > > > describing the motivation and the proposed changes (part of the core,
> > > > runtime and web).
> > > >
> > > > The intuition behind this FLIP is being able to execute custom logic
> on
> > > > failures by exposing a FailureListener interface. Implementation by
> > users
> > > > can be simply loaded to the system as Jar files. FailureListeners may
> > > also
> > > > decide to assign failure tags to errors (expressed as strings),
> > > > that will then be exposed as metadata by the UI/Rest interfaces.
> > > >
> > > > Feedback is always appreciated! Looking forward to your thoughts!
> > > >
> > > > [1]
> > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-304%
> > > > 3A+Pluggable+failure+handling+for+Apache+Flink
> > > > [2]
> > > > https://docs.google.com/document/d/1pcHg9F3GoDDeVD5GIIo2wO67
> > > > Hmjgy0-hRDeuFnrMgT4
> > > > [3] https://issues.apache.org/jira/browse/FLINK-20833
> > > >
> > > > Cheers,
> > > > Panagiotis
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-304: Pluggable failure handling for Apache Flink

Reply via email to