[
https://issues.apache.org/jira/browse/FLINK-10289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16625289#comment-16625289
]
ASF GitHub Bot commented on FLINK-10289:
----------------------------------------
isunjin opened a new pull request #6739: [FLINK-10289] [JobManager] Classify
Exceptions to different category for apply different failover strategy
URL: https://github.com/apache/flink/pull/6739
## What is the purpose of the change
We need to classify exceptions and treat them with different strategies. To
do this, we propose to introduce the following Throwable Types, and the
corresponding exceptions:
- NonRecoverable
- We shouldn’t retry if an exception was classified as NonRecoverable
- For example, NoResouceAvailiableException is a NonRecoverable Exception
- Introduce a new Exception UserCodeException to wrap all exceptions that
throw from user code
- PartitionDataMissingError
- In certain scenarios producer data was transferred in blocking mode or
data was saved in persistent store. If the partition was missing, we need to
revoke/rerun the produce task to regenerate the data.
- Introduce a new exception PartitionDataMissingException to wrap all
those kinds of issues.
- EnvironmentError
- It happened due to hardware, or software issues that were related to
specific environments. The assumption is that a task will succeed if we run it
in a different environment, and other task run in this bad environment will
very likely fail. If multiple task failures in the same machine due to
EnvironmentError, we need to consider adding the bad machine to blacklist, and
avoiding schedule task on it.
- Introduce a new exception EnvironmentException to wrap all those kind
of issues.
- Recoverable
- We assume other issues are recoverable.
## Brief change log
- *Add exception types*
- *Add a class to classify exceptions*
- *Unittests*
## Verifying this change
This change added tests and can be verified as follows:
- *Added test that validates SuppressRestartsException was a
NonRecoverable Exception*
## Does this pull request potentially affect one of the following parts:
- Dependencies (does it add or upgrade a dependency): (*no*)
- The public API, i.e., is any changed class annotated with
`@Public(Evolving)`: (no)
- The serializers: (no)
- The runtime per-record code paths (performance sensitive): (no)
- Anything that affects deployment or recovery: JobManager (and its
components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)
- The S3 file system connector: (no)
## Documentation
- Does this pull request introduce a new feature? (yes)
- If yes, how is the feature documented? YES, Document is
[(Here)](https://docs.google.com/document/d/1FdZdcA63tPUEewcCimTFy9Iz2jlVlMRANZkO4RngIuk/edit?spm=a1zcr.8293797.0.0.3b116385Btb5sf)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Classify Exceptions to different category for apply different failover
> strategy
> -------------------------------------------------------------------------------
>
> Key: FLINK-10289
> URL: https://issues.apache.org/jira/browse/FLINK-10289
> Project: Flink
> Issue Type: Sub-task
> Components: JobManager
> Reporter: JIN SUN
> Assignee: JIN SUN
> Priority: Major
> Labels: pull-request-available
>
> We need to classify exceptions and treat them with different strategies. To
> do this, we propose to introduce the following Throwable Types, and the
> corresponding exceptions:
> * NonRecoverable
> ** We shouldn’t retry if an exception was classified as NonRecoverable
> ** For example, NoResouceAvailiableException is a NonRecoverable Exception
> ** Introduce a new Exception UserCodeException to wrap all exceptions that
> throw from user code
> * PartitionDataMissingError
> ** In certain scenarios producer data was transferred in blocking mode or
> data was saved in persistent store. If the partition was missing, we need to
> revoke/rerun the produce task to regenerate the data.
> ** Introduce a new exception PartitionDataMissingException to wrap all those
> kinds of issues.
> * EnvironmentError
> ** It happened due to hardware, or software issues that were related to
> specific environments. The assumption is that a task will succeed if we run
> it in a different environment, and other task run in this bad environment
> will very likely fail. If multiple task failures in the same machine due to
> EnvironmentError, we need to consider adding the bad machine to blacklist,
> and avoiding schedule task on it.
> ** Introduce a new exception EnvironmentException to wrap all those kind of
> issues.
> * Recoverable
> ** We assume other issues are recoverable.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)