[
https://issues.apache.org/jira/browse/AIRFLOW-5931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16979356#comment-16979356
]
ASF GitHub Bot commented on AIRFLOW-5931:
-----------------------------------------
ashb commented on pull request #6627: [AIRFLOW-5931] Use os.fork when
appropriate to speed up task execution.
URL: https://github.com/apache/airflow/pull/6627
Make sure you have checked _all_ steps below.
### Jira
- [x] https://issues.apache.org/jira/browse/AIRFLOW-5931
### Description
- [x] Rather than running a fresh python interpreter which then has to
re-load
all of Airflow and its dependencies we should use os.fork when it is
available/suitable which should speed up task running, espeically for
short lived tasks.
I've profiled this and it took the task duration (as measured by the
`duration` column in the TI table) from an average of 14.063s down to
just 0.932s!
I _could_ make this change deeper and bypass the `CLIFactory`/go directly
to `_run_raw_task`, but this makes the change the minimum needed to work.
### Tests
- [x] No unit tests added. Hopefully existing tests good enough. Manual
testing shows this working
Other tests I need to perform:
- [ ] Check if `os._exit` is right (this doesn't run atexit callbacks) - so
I need to check if logging in the subprocess istidied up properly.
- [ ] Test if this leaves "dangling"/broken DB connections.
- [ ] Check remote log uploading
### Commits
- [x] My commits all reference Jira issues in their subject lines, and I
have squashed multiple commits if they address the same issue. In addition, my
commits follow the guidelines from "[How to write a good git commit
message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds documentation that describes
how to use it.
- All the public functions and the classes in the PR contain docstrings
that explain what it does
- If you implement backwards incompatible changes, please leave a note in
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so
we can assign it to a appropriate release
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Spawning new python interpreter for every task slow
> ---------------------------------------------------
>
> Key: AIRFLOW-5931
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5931
> Project: Apache Airflow
> Issue Type: Improvement
> Components: executors, worker
> Affects Versions: 2.0.0
> Reporter: Ash Berlin-Taylor
> Assignee: Ash Berlin-Taylor
> Priority: Major
>
> There are a number of places in the Executors and Task Runners where we spawn
> a whole new python interpreter.
> My profiling has shown that this is slow. Rather than running a fresh python
> interpreter which then has to re-load all of Airflow and its dependencies we
> should use {{os.fork}} when it is available/suitable which should speed up
> task running, espeically for short lived tasks.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)