ahipp13 commented on issue #40957:
URL: https://github.com/apache/airflow/issues/40957#issuecomment-2269883515
Back to comment on my current situation. After looking at TONs of logs and
messing with many settings (turning off remote logging, going to 1 scheduler,
messing with pgbouncer metadataPoolSize and resultBackendPoolSize, messing with
all the scheduler tuneable settings, messing with parallelism and parsing
processes) my issues still persist....
I have turned on DEBUG for logs and have looked at logs that go into our
Splunk instance, and this is what I can find:
```
2024-08-05T15:02:29.338627294-05:00 stdout F [2024-08-05T20:02:29.338+0000]
{taskinstance.py:1206} INFO - Marking task as SUCCESS. dag_id=DAGID,
task_id=send_email, run_id=scheduled__2024-08-05T19:00:00+00:00,
execution_date=20240805T190000, start_date=20240805T200229,
end_date=20240805T200229
2024-08-05T15:02:29.338882785-05:00 stdout F [2024-08-05T20:02:29.338+0000]
{taskinstance.py:562} DEBUG - Task Duration set to 0.308998
2024-08-05T15:02:29.349812594-05:00 stdout F [2024-08-05T20:02:29.349+0000]
{cli_action_loggers.py:88} DEBUG - Calling callbacks: []
2024-08-05T15:02:29.400439168-05:00 stdout F [2024-08-05T20:02:29.399+0000]
{local_task_job_runner.py:240} INFO - Task exited with return code 0
2024-08-05T15:02:29.421373484-05:00 stdout F [2024-08-05T20:02:29.420+0000]
{dagrun.py:931} DEBUG - number of tis tasks for <DagRunDAGID @ 2024-08-05
19:00:00+00:00: scheduled__2024-08-05T19:00:00+00:00, state:running, queued_at:
2024-08-05 20:00:00.329940+00:00. externally triggered: False>: 0 task(s)
2024-08-05T15:02:29.421554705-05:00 stdout F [2024-08-05T20:02:29.421+0000]
{taskinstance.py:3503} INFO - 0 downstream tasks scheduled from follow-on
schedule check
2024-08-05T15:02:29.423594544-05:00 stdout F [2024-08-05T20:02:29.423+0000]
{local_task_job_runner.py:222} INFO - ::endgroup::
```
As you can see, by the logs the task looks like it marks as a success and
returns with an exit code of 0. It then logs the logs to remote logging and
disposes of the db connection pool:
```
2024-08-05T15:02:29.964711666-05:00 stdout F [2024-08-05T20:02:29.964+0000]
{connectionpool.py:546} DEBUG - BLOBURL "PUT
/airflow-logs/wasb-airflow-logs/dag_id%3DDAGID/run_id%3Dscheduled__2024-08-05T19%3A00%3A00%2B00%3A00/task_id%3Dsend_email/attempt%3D1.log
HTTP/11" 201 0
2024-08-05T15:02:29.965739791-05:00 stdout F [2024-08-05T20:02:29.965+0000]
{cli_action_loggers.py:88} DEBUG - Calling callbacks: []
2024-08-05T15:02:29.969201556-05:00 stdout F [2024-08-05T20:02:29.968+0000]
{settings.py:363} DEBUG - Disposing DB connection pool (PID 7)
```
Once this has happened, the pod should complete and close, but it doesn't...
Looking at the scheduler logs right after, this is what happens:
```
2024-08-05T15:03:03.773641728-05:00 stdout F [2024-08-05T20:03:03.773+0000]
{kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-d859kywo had an event
of type ADDED
2024-08-05T15:03:03.773756004-05:00 stdout F [2024-08-05T20:03:03.773+0000]
{kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-d859kywo is Running,
annotations: <omitted>
2024-08-05T15:03:35.038948792-05:00 stdout F [2024-08-05T20:03:35.038+0000]
{kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-d859kywo had an event
of type ADDED
2024-08-05T15:03:35.039130315-05:00 stdout F [2024-08-05T20:03:35.038+0000]
{kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-d859kywo is Running,
annotations: <omitted>
```
And repeats like this over and over again and does not close the pod...
On a pod that successfully closes, this is what the logs look like:
```
2024-08-05T07:54:45.288518222-05:00 stdout F [2024-08-05T12:54:45.287+0000]
{settings.py:363} DEBUG - Disposing DB connection pool (PID 7)
2024-08-05T07:54:47.023200045-05:00 stdout F [2024-08-05T12:54:47.022+0000]
{kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-0bsziyyj had an event
of type MODIFIED
2024-08-05T07:54:47.023349257-05:00 stdout F [2024-08-05T12:54:47.023+0000]
{kubernetes_executor_utils.py:299} INFO - Event: DAGNAME-0bsziyyj is Running,
annotations: <omitted>
2024-08-05T07:54:48.321534896-05:00 stdout F [2024-08-05T12:54:48.321+0000]
{kubernetes_executor_utils.py:163} DEBUG - Event: DAGNAME-0bsziyyj had an event
of type MODIFIED
2024-08-05T07:54:48.32176453-05:00 stdout F [2024-08-05T12:54:48.321+0000]
{kubernetes_executor_utils.py:284} INFO - Event: DAGNAME-0bsziyyj Succeeded,
annotations: <omitted>
2024-08-05T07:54:48.435163144-05:00 stdout F [2024-08-05T12:54:48.434+0000]
{kubernetes_executor_utils.py:478} DEBUG - Processing task ('DAGNAME-0bsziyyj',
'eis-dw-dev-k8s', None, {'dag_id': 'DAGNAME', 'task_id': 'send_email',
'execution_date': None, 'run_id': 'scheduled__2024-08-05T12:40:00+00:00',
'try_number': '1'}, '930001253')
2024-08-05T07:54:48.435228658-05:00 stdout F [2024-08-05T12:54:48.435+0000]
{kubernetes_executor_utils.py:486} DEBUG - Attempting to finish pod; pod_name:
DAGNAME-0bsziyyj; state: None; annotations: <omitted>
2024-08-05T07:54:48.435486516-05:00 stdout F [2024-08-05T12:54:48.435+0000]
{kubernetes_executor_utils.py:494} DEBUG - finishing job
TaskInstanceKey(dag_id='DAGNAME', task_id='send_email',
run_id='scheduled__2024-08-05T12:40:00+00:00', try_number=1, map_index=-1) -
None (DAGNAME-0bsziyyj)
2024-08-05T07:54:48.436378934-05:00 stdout F [2024-08-05T12:54:48.436+0000]
{kubernetes_executor.py:370} INFO - Changing state of
(TaskInstanceKey(dag_id='DAGNAME', task_id='send_email',
run_id='scheduled__2024-08-05T12:40:00+00:00', try_number=1, map_index=-1),
None, 'DAGNAME-0bsziyyj', 'NAMESPACE', '930001253') to None
2024-08-05T07:54:48.436517827-05:00 stdout F [2024-08-05T12:54:48.436+0000]
{kubernetes_executor_utils.py:442} DEBUG - Deleting pod DAGNAME-0bsziyyj in
namespace NAMESPACE
```
So it seems like for some reason the "kubernetes_executor_utils" file is
looking at the completed task and not seeing a "succeeded" state, even though
the task logs that it is marking it as succeeded, and it also shows up as
succeeded in the Airflow UI...
Will try and keep digging but am really unsure where to look at this point.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]