t4n1o commented on issue #19192:
URL: https://github.com/apache/airflow/issues/19192#issuecomment-1003639114
Well, there is nothing locking the db. This issue occurs even if there is no
database being used by my application.
Types of tasks that cause this problem:
- download all the archives from public.bitmex.com via a python script
(internet speed is the bottleneck) (takes about 3 hours the first run)
- decompress csv.gz files into csv files (disk speed is the bottleneck)
(takes about 4 hours to run the first time)
- read csv records for each day and transform them with a custom rust
parsing tool
Any program written in rust or python that takes a long time to execute will
cause this problem. We are using airflow because once we sync all the
historical data, we run the task once per day each new day.

Here is a dump of what the scheduler is doing, while it's stuck.

State of the various airflow processes:

I am starting the rust/python programs in a separate process with
BashOperator, and it's stuck on _recv(). Since all the tasks are limited by a
rate-limit of some API, disk speed, or network speed, it would be better if
airflow could actually run more than 1 task at a time. Any ideas?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]