stephenonethree commented on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-932564234


   Good news, I have created simple reproducer code, which I have tested quite 
thoroughly. **This code reproduces the bug 100% of the time.** It's a very 
simple reproducer, which hopefully will help with fixing it. I would be very 
grateful if somebody could look at this.
   
   GENERAL NOTES:
   1. The DAG uses a custom sensor operator which pulls XCOM and Airflow 
variables several times each per poke. I am not sure, but I think the bug may 
be related to instability caused by network traffic from commands such as these 
(or any usage of the network).
   2. The DAG runs 45 sensors in parallel indefinitely. I don't think it is a 
sensor-specific issue because I have seen the same errors with ordinary tasks, 
though less frequently. Also, using sensors helps test for this bug because 
they repeat indefinitely until they error. (I am not sure if the bug requires 
so many sensors, but I do have a job with 45 sensors, which is where I 
encountered this bug.)
   3. Things sometimes start OK for the first few minutes, but you should 
always see the bug within 15 minutes, usually less. I saw it take 14 minutes 
once, right after the scheduler was restarted (maybe the scheduler is more 
reliable after it has been restarted).
   4. In my tests I often saw tasks making unexpected status transitions in the 
UI, for example moving from "scheduled" to "no_status". This might be related.
   
   ERROR MESSAGES (there are two different types of error)
   
   1. When poke_interval is 30, you will get the "Recorded pid X does not match 
the current pid Y" error in the logs. If you turn on error emails, they will 
say "Detected as zombie" and sometimes you'll get multiple such emails for a 
single sensor.
   2. If you drop poke_interval to 5, you will still get the previous error, 
but sometimes instead the task will error without an error message, and the 
error email will say, "Exception: Executor reports task instance finished 
(success) although the task says its queued. (Info: None) Was the task killed 
externally?" Sometimes instead of "(success)" the email will say "(failed)"
   
   STEPS TO SETUP TEST ENVIRONMENT
   
   So far I have only tested this on Cloud Composer, with the following 
configuration:
   
   1. Create an environment with version `composer-1.17.1-airflow-2.1.2`. This 
is their latest version.
   2. No environment variable overrides
   3. 3 worker nodes, `n1-standard-2`, 100GB disk
   4. Webserver machine type: `composer-n1-webserver-2` (default)
   5. Cloud SQL machine type: `db-n1-standard-2` (default)
   4. For Airflow configuration, my only overrides are hopefully unrelated 
(most of the `smtp` variables, `email.email_backend`, `secrets.backend`, 
`webserver.navbar_color`, `webserver.dag_default_view`).
   5. Increase the number of schedulers to 2 (This may not be required, but I 
only tested with 2 schedulers.)
   6. Create an Airflow variable in JSON format named "env_vars." This is just 
for the sake of the test.
   
   I made other changes to things that I don't think are related, for example I 
use a custom GCP service account. I can share further details if you can't 
reproduce yourself.
   
   My code is attached to this comment.
   
   That's it! I really hope this helps get to the bottom of this!
   
   
[github_issue_18041.zip](https://github.com/apache/airflow/files/7270015/github_issue_18041.zip)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to