It's this query: SELECT task_instance.task_id AS task_instance_task_id, task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date AS task_instance_execution_date, task_instance.start_date AS task_instance_start_date, task_instance.end_date AS task_instance_end_date, task_instance.duration AS task_instance_duration, task_instance.state AS task_instance_state, task_instance.try_number AS task_instance_try_number, task_instance.hostname AS task_instance_hostname, task_instance.unixname AS task_instance_unixname, task_instance.job_id AS task_instance_job_id, task_instance.pool AS task_instance_pool, task_instance.queue AS task_instance_queue, task_instance.priority_weight AS task_instance_priority_weight, task_instance.operator AS task_instance_operator, task_instance.queued_dttm AS task_instance_queued_dttm, task_instance.pid AS task_instance_pid FROM task_instance, (SELECT task_instance.task_id AS task_id, max(task_instance.execution_date) AS max_ti FROM task_instance WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state = 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5', 't6') GROUP BY task_instance.task_id) AS sq WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id = sq.task_id AND task_instance.execution_date = sq.max_ti
Taking 10-20s On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini <criccom...@apache.org> wrote: > Can confirm it's a slow query on task_instance table. Still digging. > Unfortunately, the query is truncated in my UI right now: > > SELECT task_instance.task_id AS task_instance_... > > On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini <criccom...@apache.org> > wrote: > >> Digging. Might be a bit. >> >> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbr...@gmail.com> >> wrote: >> >>> Slow query log? Db load? >>> >>> B. >>> >>> Verstuurd vanaf mijn iPad >>> >>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini <criccom...@apache.org> >>> het volgende geschreven: >>> > >>> > Note: 6.5 million TIs in the task_instance table. >>> > >>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini < >>> criccom...@apache.org> >>> > wrote: >>> > >>> >> Hey Bolke, >>> >> >>> >> Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost all >>> free >>> >> as well. >>> >> >>> >> I am thinking that this is DB related, given that it's pausing when >>> >> executing an update. Was looking at the update_state method in >>> models.py, >>> >> which logs right before the 15s pause. >>> >> >>> >> Cheers, >>> >> Chris >>> >> >>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin <bdbr...@gmail.com> >>> >> wrote: >>> >> >>> >>> Hi Chris, >>> >>> >>> >>> What is the load on your machine? (CPU/IO/MEM) It seems that the >>> executor >>> >>> is faster in checking the state than the TaskInstance is able to exit >>> >>> itself. No, I don’t consider it normal, but it was sometimes >>> reported ( >>> >>> https://github.com/apache/incubator-airflow/pull/1821) though not >>> really >>> >>> replicable as of yet. >>> >>> >>> >>> Parsing seems exceptionally slow, it might be worth looking at it >>> with a >>> >>> debugger. Very faint guess might be that something with the >>> multiprocessing >>> >>> part could do something with memory that is costly, but then we need >>> to >>> >>> know more about what is running on the system. Never clue of system >>> metrics >>> >>> could be helpful here. >>> >>> >>> >>> Bolke >>> >>> >>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini <criccom...@apache.org> >>> >>> wrote: >>> >>>> >>> >>>> Also, seeing this in EVERY task that runs: >>> >>>> >>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this >>> >>>> instance has been externally set to queued. Taking the poison pill. >>> So >>> >>>> long. >>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with >>> return >>> >>> code 0 >>> >>>> >>> >>>> >>> >>>> All successful tasks are showing this at the end of their logs. Is >>> this >>> >>>> normal? >>> >>>> >>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini < >>> >>> criccom...@apache.org> >>> >>>> wrote: >>> >>>> >>> >>>>> Hey all, >>> >>>>> >>> >>>>> I've upgraded on production. Things seem to be working so far (only >>> >>> been >>> >>>>> an hour), but I am seeing this in the scheduler logs: >>> >>>>> >>> >>>>> File Path >>> >>> PID >>> >>>>> Runtime Last Runtime Last Run >>> >>>>> ------------------------------------------------------------------ >>> >>> ----- >>> >>>>> --------- -------------- ------------------- >>> >>>>> ... >>> >>>>> /etc/airflow/dags/dags/elt/el/db.py >>> >>> 24793 >>> >>>>> 43.41s 986.63s 2017-01-23T20:04:09 >>> >>>>> ... >>> >>>>> >>> >>>>> It seems to be taking more than 15 minutes to parse this DAG. Any >>> idea >>> >>>>> what's causing this? Scheduler config: >>> >>>>> >>> >>>>> [scheduler] >>> >>>>> job_heartbeat_sec = 5 >>> >>>>> scheduler_heartbeat_sec = 5 >>> >>>>> max_threads = 2 >>> >>>>> child_process_log_directory = /var/log/airflow/scheduler >>> >>>>> >>> >>>>> The db.py file, itself, doesn't interact with any outside systems, >>> so I >>> >>>>> would have expected this to not take so long. It does, however, >>> >>>>> programmatically generate many DAGs within the single .py file. >>> >>>>> >>> >>>>> A snippet of the scheduler log is here: >>> >>>>> >>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669 >>> e8ffd65 >>> >>>>> >>> >>>>> Note how there are 10-15 second gaps occasionally. Any idea what's >>> >>> going >>> >>>>> on? >>> >>>>> >>> >>>>> Cheers, >>> >>>>> Chris >>> >>>>> >>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbr...@gmail.com >>> > >>> >>> wrote: >>> >>>>> >>> >>>>>> I created: >>> >>>>>> >>> >>>>>> - AIRFLOW-791: At start up all running dag_runs are being >>> checked, but >>> >>>>>> not fixed >>> >>>>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t >>> get >>> >>> fixed >>> >>>>>> - AIRFLOW-788: Context unexpectedly added to hive conf >>> >>>>>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date / >>> >>> interval >>> >>>>>> was specified >>> >>>>>> >>> >>>>>> I created AIRFLOW-789 to update UPDATING.md, it is also out as a >>> PR. >>> >>>>>> >>> >>>>>> Please note that I don't consider any of these blockers for a >>> release >>> >>> of >>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an >>> RC >>> >>> on Feb >>> >>>>>> 2. However if people are using a restarting scheduler >>> (run_duration >>> >>> is set) >>> >>>>>> and have a lot of running dag runs they won’t like AIRFLOW-791. >>> So a >>> >>>>>> workaround for this would be nice (we just updated dag_runs >>> directly >>> >>> in the >>> >>>>>> database to say ‘finished’ before a certain date, but we are also >>> not >>> >>> using >>> >>>>>> the run_duration). >>> >>>>>> >>> >>>>>> Bolke >>> >>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com> >>> wrote: >>> >>>>>>> >>> >>>>>>> Will do. And thanks. >>> >>>>>>> >>> >>>>>>> Adding another issue: >>> >>>>>>> >>> >>>>>>> * Some of our DAGs are not getting scheduled for some unknown >>> reason. >>> >>>>>>> Need to investigate why. >>> >>>>>>> >>> >>>>>>> Related but not root cause: >>> >>>>>>> * Logging is so chatty that it gets really hard to find the real >>> >>> issue >>> >>>>>>> >>> >>>>>>> Bolke. >>> >>>>>>> >>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davy...@airbnb.com >>> >>> .INVALID> >>> >>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>> I'd be happy to lend a hand fixing these issues and hopefully >>> some >>> >>>>>> others >>> >>>>>>>> are too. Do you mind creating jiras for these since you have the >>> >>> full >>> >>>>>>>> context? I have created a JIRA for (1) and have assigned it to >>> >>> myself: >>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780 >>> >>>>>>>> >>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin < >>> bdbr...@gmail.com> >>> >>>>>> wrote: >>> >>>>>>>> >>> >>>>>>>>> This is to report back on some of the (early) experiences we >>> have >>> >>> with >>> >>>>>>>>> Airflow 1.8.0 (beta 1 at the moment): >>> >>>>>>>>> >>> >>>>>>>>> 1. The UI does not show faulty DAG, leading to confusion for >>> >>>>>> developers. >>> >>>>>>>>> When a faulty dag is placed in the dags folder the UI would >>> report >>> >>> a >>> >>>>>>>>> parsing error. Now it doesn’t due to the separate parising >>> (but not >>> >>>>>>>>> reporting back errors) >>> >>>>>>>>> >>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive >>> >>>>>>>>> We run in a secure environment which requires this variable to >>> be >>> >>>>>>>>> whitelisted if it is modified (needs to be added to >>> UPDATING.md) >>> >>>>>>>>> >>> >>>>>>>>> 3. DagRuns do not exist for certain tasks, but don’t get fixed >>> >>>>>>>>> Log gets flooded without a suggestion what to do >>> >>>>>>>>> >>> >>>>>>>>> 4. At start up all running dag_runs are being checked, we >>> seemed to >>> >>>>>> have a >>> >>>>>>>>> lot of “left over” dag_runs (couple of thousand) >>> >>>>>>>>> - Checking was logged to INFO -> requires a fsync for every log >>> >>>>>> message >>> >>>>>>>>> making it very slow >>> >>>>>>>>> - Checking would happen at every restart, but dag_runs’ states >>> were >>> >>>>>> not >>> >>>>>>>>> being updated >>> >>>>>>>>> - These dag_runs would never er be marked anything else than >>> >>> running >>> >>>>>> for >>> >>>>>>>>> some reason >>> >>>>>>>>> -> Applied work around to update all dag_run in sql before a >>> >>> certain >>> >>>>>> date >>> >>>>>>>>> to -> finished >>> >>>>>>>>> -> need to investigate why dag_runs did not get marked >>> >>>>>> “finished/failed” >>> >>>>>>>>> >>> >>>>>>>>> 5. Our umask is set to 027 >>> >>>>>>>>> >>> >>>>>>>>> >>> >>>>>>> >>> >>>>>> >>> >>>>>> >>> >>>>> >>> >>> >>> >>> >>> >> >>> >> >> >