With this patch: $ git diff diff --git a/airflow/jobs.py b/airflow/jobs.py index f1de333..9d08e75 100644 --- a/airflow/jobs.py +++ b/airflow/jobs.py @@ -544,6 +544,7 @@ class SchedulerJob(BaseJob): .query( TI.task_id, func.max(TI.execution_date).label('max_ti')) + .with_hint(TI, 'USE INDEX (PRIMARY)') .filter(TI.dag_id == dag.dag_id) .filter(TI.state == State.SUCCESS) .filter(TI.task_id.in_(dag.task_ids))
The db.py file parses in 90s. About 10x faster. Th slow_query log is not showing any offending queries anymore. I will open a JIRA for this. Note that with_hint stuff I put in is MySQL specific, I think. Any one have any other ideas on how to affect the equivalent outcome? On Mon, Jan 23, 2017 at 3:08 PM, Chris Riccomini <criccom...@apache.org> wrote: > OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint, > query takes .47s. Without hint, 10s. Going to try and patch. > > On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini <criccom...@apache.org> > wrote: > >> This inner query takes 10s: >> >> SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) >> AS max_ti >> FROM task_instance >> WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success' >> AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id >> >> Explain seems OK: >> >> +----+-------------+---------------+------+----------------- >> -----------------------------------+----------+---------+--- >> ----+-------+--------------------------+ >> | id | select_type | table | type | possible_keys >> | key | key_len | ref | rows | Extra >> | >> +----+-------------+---------------+------+----------------- >> -----------------------------------+----------+---------+--- >> ----+-------+--------------------------+ >> | 1 | SIMPLE | task_instance | ref | >> PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state >> | ti_state | 63 | const | 81898 | Using where; Using index | >> +----+-------------+---------------+------+----------------- >> -----------------------------------+----------+---------+--- >> ----+-------+--------------------------+ >> >> A look at the query indicates that it's using the state field as its >> index lookup, which isn't good. >> >> On Mon, Jan 23, 2017 at 2:49 PM, Chris Riccomini <criccom...@apache.org> >> wrote: >> >>> It's this query: >>> >>> SELECT task_instance.task_id AS task_instance_task_id, >>> task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date >>> AS task_instance_execution_date, task_instance.start_date AS >>> task_instance_start_date, task_instance.end_date AS task_instance_end_date, >>> task_instance.duration AS task_instance_duration, task_instance.state AS >>> task_instance_state, task_instance.try_number AS task_instance_try_number, >>> task_instance.hostname AS task_instance_hostname, task_instance.unixname AS >>> task_instance_unixname, task_instance.job_id AS task_instance_job_id, >>> task_instance.pool AS task_instance_pool, task_instance.queue AS >>> task_instance_queue, task_instance.priority_weight AS >>> task_instance_priority_weight, task_instance.operator AS >>> task_instance_operator, task_instance.queued_dttm AS >>> task_instance_queued_dttm, task_instance.pid AS task_instance_pid >>> FROM task_instance, (SELECT task_instance.task_id AS task_id, >>> max(task_instance.execution_date) AS max_ti >>> FROM task_instance >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state = >>> 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5', 't6') >>> GROUP BY task_instance.task_id) AS sq >>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id = >>> sq.task_id AND task_instance.execution_date = sq.max_ti >>> >>> Taking 10-20s >>> >>> On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini <criccom...@apache.org> >>> wrote: >>> >>>> Can confirm it's a slow query on task_instance table. Still digging. >>>> Unfortunately, the query is truncated in my UI right now: >>>> >>>> SELECT task_instance.task_id AS task_instance_... >>>> >>>> On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini <criccom...@apache.org >>>> > wrote: >>>> >>>>> Digging. Might be a bit. >>>>> >>>>> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbr...@gmail.com> >>>>> wrote: >>>>> >>>>>> Slow query log? Db load? >>>>>> >>>>>> B. >>>>>> >>>>>> Verstuurd vanaf mijn iPad >>>>>> >>>>>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini < >>>>>> criccom...@apache.org> het volgende geschreven: >>>>>> > >>>>>> > Note: 6.5 million TIs in the task_instance table. >>>>>> > >>>>>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini < >>>>>> criccom...@apache.org> >>>>>> > wrote: >>>>>> > >>>>>> >> Hey Bolke, >>>>>> >> >>>>>> >> Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost >>>>>> all free >>>>>> >> as well. >>>>>> >> >>>>>> >> I am thinking that this is DB related, given that it's pausing when >>>>>> >> executing an update. Was looking at the update_state method in >>>>>> models.py, >>>>>> >> which logs right before the 15s pause. >>>>>> >> >>>>>> >> Cheers, >>>>>> >> Chris >>>>>> >> >>>>>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin < >>>>>> bdbr...@gmail.com> >>>>>> >> wrote: >>>>>> >> >>>>>> >>> Hi Chris, >>>>>> >>> >>>>>> >>> What is the load on your machine? (CPU/IO/MEM) It seems that the >>>>>> executor >>>>>> >>> is faster in checking the state than the TaskInstance is able to >>>>>> exit >>>>>> >>> itself. No, I don’t consider it normal, but it was sometimes >>>>>> reported ( >>>>>> >>> https://github.com/apache/incubator-airflow/pull/1821) though >>>>>> not really >>>>>> >>> replicable as of yet. >>>>>> >>> >>>>>> >>> Parsing seems exceptionally slow, it might be worth looking at it >>>>>> with a >>>>>> >>> debugger. Very faint guess might be that something with the >>>>>> multiprocessing >>>>>> >>> part could do something with memory that is costly, but then we >>>>>> need to >>>>>> >>> know more about what is running on the system. Never clue of >>>>>> system metrics >>>>>> >>> could be helpful here. >>>>>> >>> >>>>>> >>> Bolke >>>>>> >>> >>>>>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini <criccom...@apache.org >>>>>> > >>>>>> >>> wrote: >>>>>> >>>> >>>>>> >>>> Also, seeing this in EVERY task that runs: >>>>>> >>>> >>>>>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this >>>>>> >>>> instance has been externally set to queued. Taking the poison >>>>>> pill. So >>>>>> >>>> long. >>>>>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with >>>>>> return >>>>>> >>> code 0 >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> All successful tasks are showing this at the end of their logs. >>>>>> Is this >>>>>> >>>> normal? >>>>>> >>>> >>>>>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini < >>>>>> >>> criccom...@apache.org> >>>>>> >>>> wrote: >>>>>> >>>> >>>>>> >>>>> Hey all, >>>>>> >>>>> >>>>>> >>>>> I've upgraded on production. Things seem to be working so far >>>>>> (only >>>>>> >>> been >>>>>> >>>>> an hour), but I am seeing this in the scheduler logs: >>>>>> >>>>> >>>>>> >>>>> File Path >>>>>> >>> PID >>>>>> >>>>> Runtime Last Runtime Last Run >>>>>> >>>>> ------------------------------------------------------------ >>>>>> ------ >>>>>> >>> ----- >>>>>> >>>>> --------- -------------- ------------------- >>>>>> >>>>> ... >>>>>> >>>>> /etc/airflow/dags/dags/elt/el/db.py >>>>>> >>> 24793 >>>>>> >>>>> 43.41s 986.63s 2017-01-23T20:04:09 >>>>>> >>>>> ... >>>>>> >>>>> >>>>>> >>>>> It seems to be taking more than 15 minutes to parse this DAG. >>>>>> Any idea >>>>>> >>>>> what's causing this? Scheduler config: >>>>>> >>>>> >>>>>> >>>>> [scheduler] >>>>>> >>>>> job_heartbeat_sec = 5 >>>>>> >>>>> scheduler_heartbeat_sec = 5 >>>>>> >>>>> max_threads = 2 >>>>>> >>>>> child_process_log_directory = /var/log/airflow/scheduler >>>>>> >>>>> >>>>>> >>>>> The db.py file, itself, doesn't interact with any outside >>>>>> systems, so I >>>>>> >>>>> would have expected this to not take so long. It does, however, >>>>>> >>>>> programmatically generate many DAGs within the single .py file. >>>>>> >>>>> >>>>>> >>>>> A snippet of the scheduler log is here: >>>>>> >>>>> >>>>>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669 >>>>>> e8ffd65 >>>>>> >>>>> >>>>>> >>>>> Note how there are 10-15 second gaps occasionally. Any idea >>>>>> what's >>>>>> >>> going >>>>>> >>>>> on? >>>>>> >>>>> >>>>>> >>>>> Cheers, >>>>>> >>>>> Chris >>>>>> >>>>> >>>>>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin < >>>>>> bdbr...@gmail.com> >>>>>> >>> wrote: >>>>>> >>>>> >>>>>> >>>>>> I created: >>>>>> >>>>>> >>>>>> >>>>>> - AIRFLOW-791: At start up all running dag_runs are being >>>>>> checked, but >>>>>> >>>>>> not fixed >>>>>> >>>>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but >>>>>> don’t get >>>>>> >>> fixed >>>>>> >>>>>> - AIRFLOW-788: Context unexpectedly added to hive conf >>>>>> >>>>>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date / >>>>>> >>> interval >>>>>> >>>>>> was specified >>>>>> >>>>>> >>>>>> >>>>>> I created AIRFLOW-789 to update UPDATING.md, it is also out as >>>>>> a PR. >>>>>> >>>>>> >>>>>> >>>>>> Please note that I don't consider any of these blockers for a >>>>>> release >>>>>> >>> of >>>>>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for >>>>>> an RC >>>>>> >>> on Feb >>>>>> >>>>>> 2. However if people are using a restarting scheduler >>>>>> (run_duration >>>>>> >>> is set) >>>>>> >>>>>> and have a lot of running dag runs they won’t like >>>>>> AIRFLOW-791. So a >>>>>> >>>>>> workaround for this would be nice (we just updated dag_runs >>>>>> directly >>>>>> >>> in the >>>>>> >>>>>> database to say ‘finished’ before a certain date, but we are >>>>>> also not >>>>>> >>> using >>>>>> >>>>>> the run_duration). >>>>>> >>>>>> >>>>>> >>>>>> Bolke >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> >>>>>> >>>>>>> Will do. And thanks. >>>>>> >>>>>>> >>>>>> >>>>>>> Adding another issue: >>>>>> >>>>>>> >>>>>> >>>>>>> * Some of our DAGs are not getting scheduled for some unknown >>>>>> reason. >>>>>> >>>>>>> Need to investigate why. >>>>>> >>>>>>> >>>>>> >>>>>>> Related but not root cause: >>>>>> >>>>>>> * Logging is so chatty that it gets really hard to find the >>>>>> real >>>>>> >>> issue >>>>>> >>>>>>> >>>>>> >>>>>>> Bolke. >>>>>> >>>>>>> >>>>>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan Davydov < >>>>>> dan.davy...@airbnb.com >>>>>> >>> .INVALID> >>>>>> >>>>>> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>> I'd be happy to lend a hand fixing these issues and >>>>>> hopefully some >>>>>> >>>>>> others >>>>>> >>>>>>>> are too. Do you mind creating jiras for these since you have >>>>>> the >>>>>> >>> full >>>>>> >>>>>>>> context? I have created a JIRA for (1) and have assigned it >>>>>> to >>>>>> >>> myself: >>>>>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780 >>>>>> >>>>>>>> >>>>>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin < >>>>>> bdbr...@gmail.com> >>>>>> >>>>>> wrote: >>>>>> >>>>>>>> >>>>>> >>>>>>>>> This is to report back on some of the (early) experiences >>>>>> we have >>>>>> >>> with >>>>>> >>>>>>>>> Airflow 1.8.0 (beta 1 at the moment): >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> 1. The UI does not show faulty DAG, leading to confusion for >>>>>> >>>>>> developers. >>>>>> >>>>>>>>> When a faulty dag is placed in the dags folder the UI would >>>>>> report >>>>>> >>> a >>>>>> >>>>>>>>> parsing error. Now it doesn’t due to the separate parising >>>>>> (but not >>>>>> >>>>>>>>> reporting back errors) >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive >>>>>> >>>>>>>>> We run in a secure environment which requires this variable >>>>>> to be >>>>>> >>>>>>>>> whitelisted if it is modified (needs to be added to >>>>>> UPDATING.md) >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> 3. DagRuns do not exist for certain tasks, but don’t get >>>>>> fixed >>>>>> >>>>>>>>> Log gets flooded without a suggestion what to do >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> 4. At start up all running dag_runs are being checked, we >>>>>> seemed to >>>>>> >>>>>> have a >>>>>> >>>>>>>>> lot of “left over” dag_runs (couple of thousand) >>>>>> >>>>>>>>> - Checking was logged to INFO -> requires a fsync for every >>>>>> log >>>>>> >>>>>> message >>>>>> >>>>>>>>> making it very slow >>>>>> >>>>>>>>> - Checking would happen at every restart, but dag_runs’ >>>>>> states were >>>>>> >>>>>> not >>>>>> >>>>>>>>> being updated >>>>>> >>>>>>>>> - These dag_runs would never er be marked anything else than >>>>>> >>> running >>>>>> >>>>>> for >>>>>> >>>>>>>>> some reason >>>>>> >>>>>>>>> -> Applied work around to update all dag_run in sql before a >>>>>> >>> certain >>>>>> >>>>>> date >>>>>> >>>>>>>>> to -> finished >>>>>> >>>>>>>>> -> need to investigate why dag_runs did not get marked >>>>>> >>>>>> “finished/failed” >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> 5. Our umask is set to 027 >>>>>> >>>>>>>>> >>>>>> >>>>>>>>> >>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>>>> >>> >>>>>> >>> >>>>>> >> >>>>>> >>>>> >>>>> >>>> >>> >> >