I think you had the finger on it. If you have a frequent query against a large-ish table that cannot leverage an index, that will result in a lot of workload.
If I was in your shoes I'd run a CREATE INDEX statement against that table/field and see how it reduces your resource consumptions and make a mental note to babysit the db migration when upgrading the 1.8.0. Most likely that migration will fail, and you'll have to skip it manually by issuing an `alembic stamp` command. Max On Wed, Mar 8, 2017 at 3:25 PM, Jason Chen <[email protected]> wrote: > Thanks for the reply. > > We are using 1.7.1.3 and it looks the index is not there. > https://github.com/apache/incubator-airflow/blob/1.7.1. > 3/airflow/models.py#L660-#L664 > > Is Airflow 1.8 officially released ? I saw the version tag and discussion, > but not saw it in pypi.. > > I did run Dan's SQL statement to clean up table "task_instance" for entries > older than 30 days. > It seems not help on the CPU load. > We are using AWS RDS and not able to ssh to top the processes. > "show processlist;" from MySQL not help much. > > In general, what scenario will cause the Airflow MySQL CPU highly loaded ? > We have about 50 Dags (each one has 3-4 tasks) with high frequencies (20 > mins to one hour). Perhaps because of that ? > > Thanks. > -Jason > > On Wed, Mar 8, 2017 at 2:55 PM, Maxime Beauchemin < > [email protected]> wrote: > > > Wait. That field does have an index and it looks like Dan added it 8 > months > > ago. > > https://github.com/apache/incubator-airflow/blame/ > > master/airflow/models.py#L744 > > > > Here's the related DB migration script: > > https://github.com/apache/incubator-airflow/blob/master/ > > airflow/migrations/versions/211e584da130_add_ti_state_index.py > > > > Maybe it didn't make the previous release, it should be there in 1.8.0 if > > it wasn't in the release you are runing. > > > > You can go ahead and create the index manually as well, though that may > > break the migration script as you upgrade if you do. > > > > Max > > > > On Tue, Mar 7, 2017 at 9:57 PM, Dan Davydov <[email protected]. > > invalid> > > wrote: > > > > > We will need to come up with a plan soon (better DB indexes and/or the > > > ability to rotate out old task instances according to some policy). > > Nothing > > > concrete as of yet though. > > > > > > On Tue, Mar 7, 2017 at 6:18 PM, Jason Chen <[email protected]> > > > wrote: > > > > > > > Hi Dan, > > > > > > > > Thanks so much. This is exactly what I am looking for. > > > > > > > > Is there a plan on the future airflow road map to clean this up from > > > > Airflow system level? Say, in airflow.cfg, a setting to clean up data > > > older > > > > than specified time. > > > > > > > > Your solution is to run an airflow job to clean up the data. That's > > > great. > > > > In a short term for us, I will be just running the SQL command > directly > > > > from MySQL CLI and then setup an airflow job to do that periodically. > > > > > > > > Thanks. > > > > -Jason > > > > > > > > On Tue, Mar 7, 2017 at 5:47 PM, Dan Davydov <[email protected]. > > > > invalid> > > > > wrote: > > > > > > > > > FWIW we use the following DAG at Airbnb to reap the task instances > > > table > > > > > (this is a stopgap): > > > > > > > > > > # DAG to delete old TIs so that UI operations on the webserver are > > > fast. > > > > > This DAG is a > > > > > # stopgap, ideally we would make the UI not query all task > instances > > > and > > > > > add indexes to > > > > > # the task_instance table where appropriate to speed up the > remaining > > > > > webserver table > > > > > # queries. > > > > > # Note that there is a slight risk that some of these deleted task > > > > > instances may break > > > > > # the depends_on_past dependency for the following tasks but this > > > should > > > > > rarely happy > > > > > # and is easy to diagnose and fix. > > > > > > > > > > from datetime import datetime > > > > > > > > > > from airflow import DAG > > > > > from airflow.operators import MySqlOperator > > > > > > > > > > args = { > > > > > 'owner': 'xxx', > > > > > 'email': ['xxx'], > > > > > 'start_date': datetime(2017, 1, 30), > > > > > 'mysql_conn_id': 'airflow_db', > > > > > } > > > > > > > > > > dag = DAG( > > > > > 'airflow_old_task_instance_pruning', > > > > > default_args=args, > > > > > ) > > > > > > > > > > # TODO: TIs that have are successful without a start date will > never > > be > > > > > # reaped because they have been mark-success'd in the UI. One fix > for > > > > this > > > > > would be to > > > > > # make airflow set start_date when mark-success-ing. > > > > > sql = """\ > > > > > DELETE ti FROM task_instance ti > > > > > LEFT OUTER JOIN dag_run dr > > > > > ON ti.execution_date = dr.execution_date AND > > > > > ti.dag_id = dr.dag_id > > > > > WHERE ((ti.start_date <= DATE_SUB(NOW(), INTERVAL 30 DAY) > AND > > > > > ti.state != "running") OR > > > > > (ISNULL(ti.start_date) AND > > > > > ti.state = "failed")) AND > > > > > (ISNULL(dr.id) OR dr.state != "running") > > > > > """ > > > > > MySqlOperator( > > > > > task_id='delete_old_tis', > > > > > sql=sql, > > > > > dag=dag, > > > > > ) > > > > > > > > > > > > > > > > > > > > On Tue, Mar 7, 2017 at 5:39 PM, Jason Chen < > > [email protected]> > > > > > wrote: > > > > > > > > > > > Hi Bolke, > > > > > > > > > > > > Thanks, but it looks you are actually talking about Harish's use > > > case. > > > > > > > > > > > > My use case is about 50 Dags (each one with about 2-3 tasks). I > > feel > > > > our > > > > > > run interval setting for the dags are too low (~15 mins). It may > > > result > > > > > in > > > > > > high CPU of MySQL. > > > > > > > > > > > > Meanwhile, I dig to MySQL and I noticed a frequently running SQL > > > > > statement > > > > > > as below. It's without proper index on column > task_instance.state. > > > > > > > > > > > > Shouldn't it index "state", given that there could be million of > > rows > > > > in > > > > > > task_instance? > > > > > > > > > > > > SQL Statement: > > > > > > "SELECT task_instance.task_id AS task_instance_task_id, > > > > > > task_instance.dag_id AS task_instance_dag_id,.... FROM > > task_instance > > > > > WHERE > > > > > > task_instance.state = 'queued'" > > > > > > > > > > > > Also, is there a possibility to clean some "unneeded" entries in > > the > > > > > tables > > > > > > (say, task_instance) ? I mean, for example, removing task states > > > older > > > > > > than 6 months? > > > > > > > > > > > > Feedback are welcome. > > > > > > > > > > > > Thanks. > > > > > > > > > > > > -Jason > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 7, 2017 at 11:45 AM, Bolke de Bruin < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > Hi Jason > > > > > > > > > > > > > > I think you need to back it up with more numbers. You assume > > that a > > > > > load > > > > > > > of 100% is bad and also that 16GB of mem is a lot. > > > > > > > > > > > > > > 30x25 = 750 tasks per hour = 12,5 tasks per minute. For every > > task > > > we > > > > > > > launch a couple of processes (at least 2) that do not share > > memory, > > > > > this > > > > > > is > > > > > > > to ensure tasks cannot hurt each other. Curl tasks are probably > > > > > launched > > > > > > by > > > > > > > using a BashOperator, which means another process. Curl is > itself > > > > > another > > > > > > > process. So 4 processes per task, that cannot share memory. > Curl > > > can > > > > > > cache > > > > > > > memory itself as well. You probably have peak times and longer > > > > running > > > > > > > tasks so it is not evenly spread, then it starts adding up > > quickly? > > > > > > > > > > > > > > Bolke. > > > > > > > > > > > > > > > > > > > > > > On 7 Mar 2017, at 19:41, Jason Chen < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > > Hi Harish, > > > > > > > > Thanks for the fast response and feedback. > > > > > > > > Yeah, I want to see the fix or more discussion ! > > > > > > > > > > > > > > > > BTW, I assume that, given your 30 dags, airflow runs fine > after > > > > your > > > > > > > > increase of heartbeat ? > > > > > > > > The default is 5 secs. > > > > > > > > > > > > > > > > > > > > > > > > Thanks. > > > > > > > > Jason > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 7, 2017 at 10:24 AM, harish singh < > > > > > > [email protected]> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> I had seen a similar behavior, a year ago, when we were are > < > > 5 > > > > > Dags. > > > > > > > Even > > > > > > > >> then the cpu utilization was reaching 100%. > > > > > > > >> One way to deal with this is - You could play with > "heatbeat" > > > > > numbers > > > > > > > (i.e > > > > > > > >> increase heartbeat). > > > > > > > >> But then you are introducing more delay to start jobs that > are > > > > ready > > > > > > to > > > > > > > run > > > > > > > >> (ready to be queued -> queued -> run) > > > > > > > >> > > > > > > > >> Right now, we have more than 30 dags (each with ~ 20-25 > tasks) > > > > that > > > > > > runs > > > > > > > >> every hour. > > > > > > > >> We are giving airflow about 5-6 cores (which still seems > less > > > for > > > > > > > airflow). > > > > > > > >> Also, for so many tasks every hour, our mem consumption is > > over > > > > > 16G. > > > > > > > >> All our tasks are basically doing "curl". So 16G seems too > > high. > > > > > > > >> > > > > > > > >> Having said that, I remember reading somewhere that there > was > > a > > > > fix > > > > > > > coming > > > > > > > >> for this. > > > > > > > >> If not, I would definitely want to see more discussion on > > this. > > > > > > > >> > > > > > > > >> Thanks for opening this. I would love to hear on how people > > are > > > > > > working > > > > > > > >> around this. > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > >> On Tue, Mar 7, 2017 at 9:42 AM, Jason Chen < > > > > > [email protected] > > > > > > > > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >>> Hi team, > > > > > > > >>> > > > > > > > >>> We are using airflow v1.7.1.3 and schedule about 50 dags > > (each > > > > dags > > > > > > is > > > > > > > >>> about 10 to one hour intervals). It's with LocalExecutor. > > > > > > > >>> > > > > > > > >>> Recently, we noticed the RDS (MySQL 5.6.x with AWS) runs > with > > > > ~100% > > > > > > > CPU. > > > > > > > >>> I am wondering if airflow scheduler and webserver can cause > > > high > > > > > CPU > > > > > > > load > > > > > > > >>> of MySQL, given ~50 dags? > > > > > > > >>> I feel MySQL should be light load.. > > > > > > > >>> > > > > > > > >>> Thanks. > > > > > > > >>> -Jason > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
