trlopes1974 commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2226843279
> y looks like for some reason something is killing your tasks. and this
will happen for example when you have not enough memory (or other resources
that your tasks need
I don't want to be "rude ( not the correct word)" but like I said, it seems
that we are dealing with different issues.
Unlike @andrew-stein-sp , I don't have a single task timeout, or at least I
cannot find it in the logs. Also don't believe on a DNS issue as we are using a
local server, and all our "foreign" hosts are on the etc/hosts file so there
are very low number of requests to DNS.
Also don't believe that is a lack of resources issue, our box is not paging
and we have a low number os dags with also low number of tasks, despite that we
do have several 5m interval running dags.
Most of our errors are related to 2 specific dags ( nothing of special/
particular with those) but one of them does use several SSHOperators where the
most failed tasks ( celery tasks, not airflow tasks) are marks as failed.
Also the failed tasks can occur on some EmptyOperator tasks ( and these are
Airflow tasks ) with that amazing error """Exception: Executor reports task
instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores
scheduled__2024-07-07T12:07:00+00:00 [queued]> finished (failed) although the
task says it's queued. (Info: None) Was the task killed externally?"""
There is no resource contention that we noticed ( just remembered that we
did not check the database = postgresSql)
I can admit that something at OS level might be killing some of these tasks,
but I cannot find any trace of it.
I did find some old posts regarding "donot_pickle " on a guy that has
similar issues but we do have that setting to True.
On a configuration side. we have this as default:
# Time in seconds after which tasks queued in celery are assumed to be
stalled, and are automatically
# rescheduled. Adopted tasks will instead use the ``task_adoption_timeout``
setting if specified.
# When set to 0, automatic clearing of stalled tasks is disabled.
stalled_task_timeout = 0
With this setting, makes sense ( to me) that we have a failing tasks after
10m ( like stated in previous logs), but it does not explain why te task was
queued but never executed.
So if the tasks is QUEUED but does not execute within the defined timeouts,
when the
task_queued_timeout=600 is reached ,the task is not rescheduled (as
stalled_task_timeout is 0) and it is marked as failed ( this is my logic, have
not looked into any code). If this is the case, the logs should be improved so
that there are no doubts on why the task is failing. Yet, this does not explain
WHY the task was not executed or, if it was, why dit it failed / stalled.
( a bit confusing )
I'll love to help on this, but I cannot even find in the logs .
I also have a doubt. In celery_executor_utils.py
def _execute_in_fork(command_to_exec: CommandType, celery_task_id: str |
None = None) -> None:
pid = os.fork()
if pid:
# In parent, wait for the child
log.info(f"fork pid: {pid} - celery_task_id:{celery_task_id}")
pid, ret = os.waitpid(pid, 0)
log.info(f"wait pid PID/RET: {pid}/{ret}")
if ret == 0:
return
msg = f"Celery command failed on host: {get_hostname()} with
celery_task_id {celery_task_id} (PID: {pid}, Return Code: {ret})"
raise AirflowException(msg)
In our case ( the error is not when the task is executed but on the
waitpid() that is returning (256 in our case), so it did not even launched the
task.
I found out that using for or EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=true has
the same result...
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]