stephenonethree commented on issue #18041: URL: https://github.com/apache/airflow/issues/18041#issuecomment-932564234
Good news, I have created simple reproducer code, which I have tested quite thoroughly. **This code reproduces the bug 100% of the time.** It's a very simple reproducer, which hopefully will help with fixing it. I would be very grateful if somebody could look at this. GENERAL NOTES: 1. The DAG uses a custom sensor operator which pulls XCOM and Airflow variables several times each per poke. I am not sure, but I think the bug may be related to instability caused by network traffic from commands such as these (or any usage of the network). 2. The DAG runs 45 sensors in parallel indefinitely. I don't think it is a sensor-specific issue because I have seen the same errors with ordinary tasks, though less frequently. Also, using sensors helps test for this bug because they repeat indefinitely until they error. (I am not sure if the bug requires so many sensors, but I do have a job with 45 sensors, which is where I encountered this bug.) 3. Things sometimes start OK for the first few minutes, but you should always see the bug within 15 minutes, usually less. I saw it take 14 minutes once, right after the scheduler was restarted (maybe the scheduler is more reliable after it has been restarted). 4. In my tests I often saw tasks making unexpected status transitions in the UI, for example moving from "scheduled" to "no_status". This might be related. ERROR MESSAGES (there are two different types of error) 1. When poke_interval is 30, you will get the "Recorded pid X does not match the current pid Y" error in the logs. If you turn on error emails, they will say "Detected as zombie" and sometimes you'll get multiple such emails for a single sensor. 2. If you drop poke_interval to 5, you will still get the previous error, but sometimes instead the task will error without an error message, and the error email will say, "Exception: Executor reports task instance finished (success) although the task says its queued. (Info: None) Was the task killed externally?" Sometimes instead of "(success)" the email will say "(failed)" STEPS TO SETUP TEST ENVIRONMENT So far I have only tested this on Cloud Composer, with the following configuration: 1. Create an environment with version `composer-1.17.1-airflow-2.1.2`. This is their latest version. 2. No environment variable overrides 3. 3 worker nodes, `n1-standard-2`, 100GB disk 4. Webserver machine type: `composer-n1-webserver-2` (default) 5. Cloud SQL machine type: `db-n1-standard-2` (default) 4. For Airflow configuration, my only overrides are hopefully unrelated (most of the `smtp` variables, `email.email_backend`, `secrets.backend`, `webserver.navbar_color`, `webserver.dag_default_view`). 5. Increase the number of schedulers to 2 (This may not be required, but I only tested with 2 schedulers.) 6. Create an Airflow variable in JSON format named "env_vars." This is just for the sake of the test. I made other changes to things that I don't think are related, for example I use a custom GCP service account. I can share further details if you can't reproduce yourself. My code is attached to this comment. That's it! I really hope this helps get to the bottom of this! [github_issue_18041.zip](https://github.com/apache/airflow/files/7270015/github_issue_18041.zip) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
