trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226843279

   > y looks like for some reason something is killing your tasks. and this 
will happen for example when you have not enough memory (or other resources 
that your tasks need
   
   I don't want to be "rude ( not the correct word)" but like I said, it seems 
that we are dealing with different issues. 
   Unlike @andrew-stein-sp , I don't have a single task timeout, or at least I 
cannot find it in the logs. Also don't believe on a DNS issue as we are using a 
local server, and all our "foreign" hosts are on the etc/hosts file so there 
are very low number of requests to DNS.
   Also don't believe that is a lack of resources issue, our box is not paging 
and we have a low number os dags with also low number of tasks, despite that we 
do have several 5m interval running dags.
   Most of our errors are related to 2 specific dags ( nothing of special/ 
particular with those) but one of them does use several SSHOperators where the 
most failed tasks ( celery tasks, not airflow tasks) are marks as failed.
   Also the failed tasks can occur on some EmptyOperator tasks ( and these are 
Airflow tasks ) with that amazing error """Exception: Executor reports task 
instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores 
scheduled__2024-07-07T12:07:00+00:00 [queued]> finished (failed) although the 
task says it's queued. (Info: None) Was the task killed externally?"""
   
   There is no resource contention that we noticed ( just remembered that we 
did not check the database = postgresSql)
   
   I can admit that something at OS level might be killing some of these tasks, 
but I cannot find any trace of it.
   I did find some old posts regarding "donot_pickle " on a guy that has 
similar issues but we do have that setting to True.
   
   On a configuration side. we have this as default:
   # Time in seconds after which tasks queued in celery are assumed to be 
stalled, and are automatically
   # rescheduled. Adopted tasks will instead use the ``task_adoption_timeout`` 
setting if specified.
   # When set to 0, automatic clearing of stalled tasks is disabled.
   stalled_task_timeout = 0
   
   With this setting, makes sense ( to me) that we have a failing tasks after 
10m ( like stated in previous logs), but  it does not explain why te task was 
queued but never executed. 
   So if the tasks is QUEUED but does not execute within the defined timeouts, 
when the
   task_queued_timeout=600 is reached ,the task is not rescheduled (as 
stalled_task_timeout is 0) and it is marked as failed ( this is my logic, have 
not looked into any code). If this is the case, the logs should be improved so 
that there are no doubts on why the task is failing. Yet, this does not explain 
WHY the task was not executed or, if it was, why dit it failed / stalled.
   ( a bit confusing )
   
   I'll love to help on this, but I cannot even find in the logs .
   
   
   I also have a doubt. In celery_executor_utils.py
   
   def _execute_in_fork(command_to_exec: CommandType, celery_task_id: str | 
None = None) -> None:
       pid = os.fork()
       if pid:
           # In parent, wait for the child
           log.info(f"fork pid: {pid}  - celery_task_id:{celery_task_id}")
           pid, ret = os.waitpid(pid, 0)
           log.info(f"wait pid PID/RET: {pid}/{ret}")
           if ret == 0:
               return
   
           msg = f"Celery command failed on host: {get_hostname()} with 
celery_task_id {celery_task_id} (PID: {pid}, Return Code: {ret})"
           raise AirflowException(msg)
   
   
   In our case ( the error is not when the task is executed but on the 
waitpid() that is returning (256 in our case), so it did not even launched the 
task.
   
   I found out that using for or EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=true has 
the same result...
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to