[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

Zhenqiu Huang (Jira) Tue, 05 Jan 2021 23:04:36 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259457#comment-17259457
 ]


Zhenqiu Huang commented on FLINK-20833:
---------------------------------------

[~trohrmann]
Thanks for the suggestion. As ExecutionFailureHandler is the central place to 
handle errors, I think we can add it here. I think the change can be summarized 
as below:

1) Add an interface for the customizable failure classifier.  We may name it 
ExecutionFailureClassifier. 
2) Add a DefaultExecutionFailureClassifier, but it basically a no-op 
implementation.
3) Add a JobManagerOption to allow users to set the class name, the default 
value is DefaultExecutionFailureClassifier.
4) In the DefaultSchedule, we use to new JobManagerOption to initialize an 
ExecutionFailureClassifier, and pass it into ExecutionFailureHandler.

After thinking more about implementation, I feel using a service provider here 
is too heavy. As we need to put DefaultExecutionFailureClassifier into the 
resource of the runtime module. If users want to override it, they need to be 
able to exclude the default one. How do you think?



> Expose pluggable interface for  exception analysis and metrics reporting in 
> Execution Graph
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20833
>                 URL: https://issues.apache.org/jira/browse/FLINK-20833
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Zhenqiu Huang
>            Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the 
> failure reason( for example user code, networking, dependencies and etc) for 
> Flink jobs and emit metrics for those analyzed results. So that platform can 
> provide an accurate value for system reliability by distinguishing the 
> failure due to user logic from the system issues. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-20833) Expose pluggable interface for exception analysis and metrics reporting in Execution Graph

Reply via email to