Hi all, We recently doubled the number of tasks and dags in our core data pipeline in Airflow (~3500 tasks now). We started doing a few things to handle the increased load: Increased resources server side to handle the increased CPU usage, Added priority_weight to tasks, Switched to Celery/Flower from LocalExecutor.
We started seeing some interesting errors associated with Celery no longer running tasks. We see the tasks in the Airflow UI sitting in the queue in a scheduled or None state. Flower shows celery processing nothing. It starts around 45-60mins into the run. The errors are along the lines of: ERROR - Executor reports task instance <TaskInstance: yadayada 2019-04-24 18:00:00+00:00 [queued]> finished (success) although the task says its queued. Was the task killed externally? NoneType: None For the same tasks.... we're seeing try_number set to a negative number.... Any thoughts on what could lead to Airflow assigning a negative number to try_number? Any thoughts on what results in the executor and the task being out of sync with the task state? Are these related? Any ideas would be welcome. The tasks themselves in the Airflow UI have no logs showing when I go to that view... so it's sad. Thanks! *Teresa Martyny* pronouns: she, her, hers Software Engineer | Data Team Lead | Omada Health <https://www.omadahealth.com/> 500 Sansome St #200, SF, CA 94111 *What is Omada?* <https://vimeo.com/203386025> -- This email may contain material that is confidential and/or privileged for the sole use of the intended recipient. Any review, reliance, or distribution by others or forwarding without express permission is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. Also note that email is not an appropriate way to send protected health information to Omada Health employees. Please use your discretion when responding to this email.
