[
https://issues.apache.org/jira/browse/AIRFLOW-3822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16762631#comment-16762631
]
ASF GitHub Bot commented on AIRFLOW-3822:
-----------------------------------------
dmateusp commented on pull request #4663: [AIRFLOW-3822] Delete
KubernetesPodOperator pod on kill
URL: https://github.com/apache/airflow/pull/4663
Make sure you have checked _all_ steps below.
### Jira
> - [X] My PR addresses the following [Airflow
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
- https://issues.apache.org/jira/browse/AIRFLOW-XXX
- In case you are fixing a typo in the documentation you can prepend your
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
https://issues.apache.org/jira/browse/AIRFLOW-3822
### Description
> - [X] Here are some details about my PR, including screenshots of any UI
changes:
Correcting a behavior observed with the KubernetesExecutor +
KubernetesPodOperator where a timeout kills the pod watcher but not the pod
running the actual work.
* I added an on_kill hook which removes the pod when the watcher pod is
terminated.
* I changed `pod` to `self.pod` in order to reference it in the `on_kill`
function
* I removed the general catch AirflowException, because the logging message
was not bringing debugging insight and it was upcasting a TimeoutException
which was preventing the on_kill hook to be triggered
### Tests
> - [X] My PR adds the following unit tests __OR__ does not need testing for
this extremely good reason:
In order for a test to make sense it would need to run from within a
Kubernetes cluster; I have tried building on top of the Docker image given in
Contributing, adding postgres resources to be launched in a local kubernetes
cluster, mounting the Airflow repo as a hostVolume, running `pip install -e`
but the started KubernetesPods also needs a built image in order to run so I
hit a wall there. (But even before hitting that wall I had created quite a big
README at that point already)
Maybe we can have a chat on how to test this properly, I didn't see any test
in the codebase using `in_cluster=True` so far.
Here's what I have attempted to do
https://github.com/dmateusp/incubator-airflow/tree/AIRFLOW-3822_test
### Commits
> - [X] My commits all reference Jira issues in their subject lines, and I
have squashed multiple commits if they address the same issue. In addition, my
commits follow the guidelines from "[How to write a good git commit
message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
### Documentation
> - [X] In case of new functionality, my PR adds documentation that
describes how to use it.
- When adding new operators/hooks/sensors, the autoclass documentation
generation needs to be added.
- All the public functions and the classes in the PR contain docstrings
that explain what it does
Corrects a bug rather than creating a new feature
### Code Quality
> - [X] Passes `flake8`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Airflow on Kubernetes, KubernetesPodOperator doesn't stop a task after timeout
> ------------------------------------------------------------------------------
>
> Key: AIRFLOW-3822
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3822
> Project: Apache Airflow
> Issue Type: Bug
> Reporter: Daniel Mateus Pires
> Assignee: Daniel Mateus Pires
> Priority: Major
>
> Airflow with KubernetesExecutor starts a "watcher pod" which controls the
> lifecycle of the actual Operator instance, when a timeout occurs it happens
> on the watcher pod which dies without killing the Operator instance.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)