potiuk commented on PR #35097:
URL: https://github.com/apache/airflow/pull/35097#issuecomment-1773761513
I like the idea of mentioning and being explicit about expensive APIs but I
think there is a value in mentioning the imports, because not many people are
aware how big of an impact such imports might have.
I believe `numpy` was indeed not a good example. It does import slowly for
the first time but mostly it's because it has to load a lot of C (.so)
libraries into memory, numpy is just a thin wrapper around mostly C code, so
once those .so libraries are loaded, the import will be fast. Generally
anything < 0.2 s seems instanteneous (and numpy imports faster than that).
I suggest to add the "expensive" operation as well, but also keep "slow
import" example in - but replacing it with `pandas` to show much bigger effect
it might have. Pandas is written mostly in Python (and uses numpy under the
hood among others) and it is notoriously known from being slow to import as it
imports ~700 python files (and that all after `__pycache__` and `.pyc` bytecode
files have been computed and numpy shared .so libraries loaded in memory).
Some experiments:
It takes some 0.3 - 0.4 s to import on my MacOS:
```
python -c 'import pandas' 0.46s user 1.91s system 649% cpu 0.364 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas' 0.65s user 1.73s system 647% cpu 0.367 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas' 0.72s user 1.46s system 658% cpu 0.331 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas' 0.45s user 1.69s system 628% cpu 0.341 total
[jarek:~] [airflow-3.11] % time python -c 'import pandas'
python -c 'import pandas' 1.08s user 1.34s system 562% cpu 0.430 total
```
So ~ 0.5 s on my MacOS.
And around ~ 0.3 s in my docker container:
```
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
real 0m0.323s
user 0m0.781s
sys 0m0.066s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
real 0m0.334s
user 0m0.780s
sys 0m0.079s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
real 0m0.291s
user 0m0.760s
sys 0m0.056s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
real 0m0.291s
user 0m0.742s
sys 0m0.075s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import pandas'
real 0m0.284s
user 0m0.744s
sys 0m0.057s
```
Importing Pandas results in opening around 750 files:
```
strace python -c 'import pandas' 2>&1 | grep openat | wc
750 3972 96013
```
The same exercise for numpy shows that it is much faster in container
(~0.1s) and opens far less number of files:
```
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
real 0m0.105s
user 0m0.342s
sys 0m0.028s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
real 0m0.141s
user 0m0.571s
sys 0m0.026s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
real 0m0.126s
user 0m0.352s
sys 0m0.038s
root@cbc7d85dfa99:/opt/airflow# time python -c 'import numpy'
real 0m0.122s
user 0m0.341s
sys 0m0.044s
```
Opened files:
```
strace python -c 'import numpy' 2>&1 | grep openat | wc
291 1593 35597
```
Fragment of the strace for pandas - showing that it imports a lot of code.
```
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/computation",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/expressions.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/computation/__pycache__/check.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/missing.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/dispatch.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/invalid.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/common.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/docstrings.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/ops/__pycache__/mask_ops.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pandas/core/arrays/__pycache__/_arrow_string_mixins.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pyarrow/__pycache__/compute.cpython-311.pyc",
O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD,
"/usr/local/lib/python3.11/site-packages/pyarrow/_compute.cpython-311-aarch64-linux-gnu.so",
O_RDONLY|O_CLOEXEC) = 3
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]