[
https://issues.apache.org/jira/browse/FLINK-20833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17259457#comment-17259457
]
Zhenqiu Huang commented on FLINK-20833:
---------------------------------------
[~trohrmann]
Thanks for the suggestion. As ExecutionFailureHandler is the central place to
handle errors, I think we can add it here. I think the change can be summarized
as below:
1) Add an interface for the customizable failure classifier. We may name it
ExecutionFailureClassifier.
2) Add a DefaultExecutionFailureClassifier, but it basically a no-op
implementation.
3) Add a JobManagerOption to allow users to set the class name, the default
value is DefaultExecutionFailureClassifier.
4) In the DefaultSchedule, we use to new JobManagerOption to initialize an
ExecutionFailureClassifier, and pass it into ExecutionFailureHandler.
After thinking more about implementation, I feel using a service provider here
is too heavy. As we need to put DefaultExecutionFailureClassifier into the
resource of the runtime module. If users want to override it, they need to be
able to exclude the default one. How do you think?
> Expose pluggable interface for exception analysis and metrics reporting in
> Execution Graph
> -------------------------------------------------------------------------------------------
>
> Key: FLINK-20833
> URL: https://issues.apache.org/jira/browse/FLINK-20833
> Project: Flink
> Issue Type: Improvement
> Components: Runtime / Coordination
> Affects Versions: 1.12.0
> Reporter: Zhenqiu Huang
> Priority: Minor
>
> For platform users of Apache flink, people usually want to classify the
> failure reason( for example user code, networking, dependencies and etc) for
> Flink jobs and emit metrics for those analyzed results. So that platform can
> provide an accurate value for system reliability by distinguishing the
> failure due to user logic from the system issues.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)