kosteev opened a new issue #17381:
URL: https://github.com/apache/airflow/issues/17381


   **Apache Airflow version**: 2.1.1
   
   **What happened**:
   
   `_check_for_stalled_adopted_tasks` method breaks on first item which 
satisfies condition
   
https://github.com/apache/airflow/blob/2.1.1/airflow/executors/celery_executor.py#L353
   
   From the comment/logic, it looks like the idea is to optimize this piece of 
code, however it is not seen that `self.adopted_task_timeouts` object maintains 
entries sorted by timestamp. This results in unstable behaviour of scheduler, 
which means it sometimes may not resend tasks to Celery (due to skipping them).
   
   Confirmed with Airflow 2.1.1
   
   **What you expected to happen**:
   
   Deterministic behaviour of scheduler in this case
   
   **How to reproduce it**:
   
   These are steps to reproduce adoption of tasks. To reproduce unstable 
behaviour, you may need to do trigger some additional DAGs in the process.
   
   - set [core]parallelism to 30
   - trigger DAG with concurrency==100 and 30 tasks, each is running for 30 
minutes (e.g. sleep 1800)
   - 18 of them will be running, others will be in queued state
   - restart scheduler
   - observe "Adopted tasks were still pending after 0:10:00, assuming they 
never made it to celery and clearing:"
   - tasks will be failed and marked as "up for retry"
   
   The important thing is that scheduler has to restart once tasks get to the 
queue, so that it will adopt queued tasks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to