[GitHub] [airflow] stephenonethree commented on issue #18041: Tasks intermittently gets terminated with SIGTERM on kubernetes executor

GitBox Fri, 01 Oct 2021 14:05:29 -0700


stephenonethree commented on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-932564234

Good news, I have created simple reproducer code, which I have tested quite
thoroughly. **This code reproduces the bug 100% of the time.** It's a very
simple reproducer, which hopefully will help with fixing it. I would be very
grateful if somebody could look at this.

GENERAL NOTES:
1. The DAG uses a custom sensor operator which pulls XCOM and Airflow
variables several times each per poke. I am not sure, but I think the bug may
be related to instability caused by network traffic from commands such as these
(or any usage of the network).
2. The DAG runs 45 sensors in parallel indefinitely. I don't think it is a
sensor-specific issue because I have seen the same errors with ordinary tasks,
though less frequently. Also, using sensors helps test for this bug because
they repeat indefinitely until they error. (I am not sure if the bug requires
so many sensors, but I do have a job with 45 sensors, which is where I
encountered this bug.)
3. Things sometimes start OK for the first few minutes, but you should
always see the bug within 15 minutes, usually less. I saw it take 14 minutes
once, right after the scheduler was restarted (maybe the scheduler is more
reliable after it has been restarted).
4. In my tests I often saw tasks making unexpected status transitions in the
UI, for example moving from "scheduled" to "no_status". This might be related.

ERROR MESSAGES (there are two different types of error)

1. When poke_interval is 30, you will get the "Recorded pid X does not match
the current pid Y" error in the logs. If you turn on error emails, they will
say "Detected as zombie" and sometimes you'll get multiple such emails for a
single sensor.
2. If you drop poke_interval to 5, you will still get the previous error,
but sometimes instead the task will error without an error message, and the
error email will say, "Exception: Executor reports task instance finished
(success) although the task says its queued. (Info: None) Was the task killed
externally?" Sometimes instead of "(success)" the email will say "(failed)"

STEPS TO SETUP TEST ENVIRONMENT

So far I have only tested this on Cloud Composer, with the following
configuration:

1. Create an environment with version `composer-1.17.1-airflow-2.1.2`. This
is their latest version.
2. No environment variable overrides
3. 3 worker nodes, `n1-standard-2`, 100GB disk
4. Webserver machine type: `composer-n1-webserver-2` (default)
5. Cloud SQL machine type: `db-n1-standard-2` (default)
4. For Airflow configuration, my only overrides are hopefully unrelated
(most of the `smtp` variables, `email.email_backend`, `secrets.backend`,
`webserver.navbar_color`, `webserver.dag_default_view`).
5. Increase the number of schedulers to 2 (This may not be required, but I
only tested with 2 schedulers.)
6. Create an Airflow variable in JSON format named "env_vars." This is just
for the sake of the test.

I made other changes to things that I don't think are related, for example I
use a custom GCP service account. I can share further details if you can't
reproduce yourself.

My code is attached to this comment.

That's it! I really hope this helps get to the bottom of this!

[github_issue_18041.zip](https://github.com/apache/airflow/files/7270015/github_issue_18041.zip)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] stephenonethree commented on issue #18041: Tasks intermittently gets terminated with SIGTERM on kubernetes executor

Reply via email to