[PR] Remove state sync during celery task processing [airflow]

via GitHub Thu, 29 Aug 2024 17:51:03 -0700


Kytha opened a new pull request, #41870:
URL: https://github.com/apache/airflow/pull/41870

<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
-->

<!--
Thank you for contributing! Please make sure that your code changes
are covered with tests. And in case of new features or big changes
remember to adjust the documentation.

Feel free to ping committers for the review!

In case of an existing issue, reference it using one of the following:

closes: #ISSUE
related: #ISSUE

How to write a good git commit message:
http://chris.beams.io/posts/git-commit/
-->

An Airflow scheduler performance test highlighted a very hot piece of code
in the celery executor when using a database results backend. This code seemed
to be doing redundant work. Below is a flamegraph of the airflow scheduler
process captured by a statistical profiler
([py-spy](https://github.com/benfred/py-spy)) during a period of heavy load
(4000 tasks required scheduling).

![Screenshot 2024-08-23 at 2 02 00
AM](https://github.com/user-attachments/assets/0279bdf2-1e27-4599-abf6-b59a0a6b3d45)

During this 1 minute profiler session, the scheduler spent 42% of it's time
on nested within [this
line](https://github.com/apache/airflow/blob/main/airflow/providers/celery/executors/celery_executor.py#L304)
of code. The reason this code is so hot is that when using celery with a
database results backend, celery [will not pool database
connections](https://github.com/celery/celery/blob/main/celery/backends/database/session.py#L43-L53)
(unless process is forked) and thus a new db connection must be established
for each task in the loop. This is very expensive and scales with number of
tasks. We can see from flame graph most of the time is spent creating a
database connection.

The solution put forth in this PR is to remove this operation entirely from
the _process_tasks function. This is based on the justification that
immediately after the celery executor processes tasks, the [sync method of the
celery executor will be called by its parent base
executor](https://github.com/apache/airflow/blob/main/airflow/executors/base_executor.py#L241-L245)
to sync task state. In my view, this renders this line of code redundant.

When calling sync, the celery executor [makes use of batch
fetching](https://github.com/apache/airflow/blob/main/airflow/providers/celery/executors/celery_executor.py#L340)
and thus is more optimized.

Some additional deployment details
```
Airflow Version: 2.9.2
Python Version: 3.11
Platform: Amazon MWAA
Celery results backend: PostgreSQL
Celery broker: Amazon SQS
```

---
**^ Add meaningful description above**
Read the **[Pull Request
Guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#pull-request-guidelines)**
for more information.
In case of fundamental code changes, an Airflow Improvement Proposal
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvement+Proposals))
is needed.
In case of a new dependency, check compliance with the [ASF 3rd Party
License Policy](https://www.apache.org/legal/resolved.html#category-x).
In case of backwards incompatible changes please leave a note in a
newsfragment file, named `{pr_number}.significant.rst` or
`{issue_number}.significant.rst`, in
[newsfragments](https://github.com/apache/airflow/tree/main/newsfragments).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Remove state sync during celery task processing [airflow]

Reply via email to