Re: Experiences with 1.8.0

Chris Riccomini Mon, 23 Jan 2017 15:27:53 -0800

With this patch:

$ git diff
diff --git a/airflow/jobs.py b/airflow/jobs.py
index f1de333..9d08e75 100644
--- a/airflow/jobs.py
+++ b/airflow/jobs.py
@@ -544,6 +544,7 @@ class SchedulerJob(BaseJob):
             .query(
                 TI.task_id,
                 func.max(TI.execution_date).label('max_ti'))
+            .with_hint(TI, 'USE INDEX (PRIMARY)')
             .filter(TI.dag_id == dag.dag_id)
             .filter(TI.state == State.SUCCESS)
             .filter(TI.task_id.in_(dag.task_ids))


The db.py file parses in 90s. About 10x faster.

Th slow_query log is not showing any offending queries anymore. I will open
a JIRA for this. Note that with_hint stuff I put in is MySQL specific, I
think. Any one have any other ideas on how to affect the equivalent outcome?

On Mon, Jan 23, 2017 at 3:08 PM, Chris Riccomini <criccom...@apache.org>
wrote:

> OK, it's using `state` instead of PRIMARY. Using PRIMARY with a hint,
> query takes .47s. Without hint, 10s. Going to try and patch.
>
> On Mon, Jan 23, 2017 at 2:57 PM, Chris Riccomini <criccom...@apache.org>
> wrote:
>
>> This inner query takes 10s:
>>
>> SELECT task_instance.task_id AS task_id, max(task_instance.execution_date)
>> AS max_ti
>> FROM task_instance
>> WHERE task_instance.dag_id = 'dag1' AND task_instance.state = 'success'
>> AND task_instance.task_id IN ('t1', 't2') GROUP BY task_instance.task_id
>>
>> Explain seems OK:
>>
>> +----+-------------+---------------+------+-----------------
>> -----------------------------------+----------+---------+---
>> ----+-------+--------------------------+
>> | id | select_type | table         | type | possible_keys
>>                      | key      | key_len | ref   | rows  | Extra
>>          |
>> +----+-------------+---------------+------+-----------------
>> -----------------------------------+----------+---------+---
>> ----+-------+--------------------------+
>> |  1 | SIMPLE      | task_instance | ref  | 
>> PRIMARY,ti_dag_state,ti_pool,ti_state_lkp,ti_state
>> | ti_state | 63      | const | 81898 | Using where; Using index |
>> +----+-------------+---------------+------+-----------------
>> -----------------------------------+----------+---------+---
>> ----+-------+--------------------------+
>>
>> A look at the query indicates that it's using the state field as its
>> index lookup, which isn't good.
>>
>> On Mon, Jan 23, 2017 at 2:49 PM, Chris Riccomini <criccom...@apache.org>
>> wrote:
>>
>>> It's this query:
>>>
>>> SELECT task_instance.task_id AS task_instance_task_id,
>>> task_instance.dag_id AS task_instance_dag_id, task_instance.execution_date
>>> AS task_instance_execution_date, task_instance.start_date AS
>>> task_instance_start_date, task_instance.end_date AS task_instance_end_date,
>>> task_instance.duration AS task_instance_duration, task_instance.state AS
>>> task_instance_state, task_instance.try_number AS task_instance_try_number,
>>> task_instance.hostname AS task_instance_hostname, task_instance.unixname AS
>>> task_instance_unixname, task_instance.job_id AS task_instance_job_id,
>>> task_instance.pool AS task_instance_pool, task_instance.queue AS
>>> task_instance_queue, task_instance.priority_weight AS
>>> task_instance_priority_weight, task_instance.operator AS
>>> task_instance_operator, task_instance.queued_dttm AS
>>> task_instance_queued_dttm, task_instance.pid AS task_instance_pid
>>> FROM task_instance, (SELECT task_instance.task_id AS task_id,
>>> max(task_instance.execution_date) AS max_ti
>>> FROM task_instance
>>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.state =
>>> 'success' AND task_instance.task_id IN ('t1', 't2', 't3', 't4', 't5', 't6')
>>> GROUP BY task_instance.task_id) AS sq
>>> WHERE task_instance.dag_id = 'some_dag_id' AND task_instance.task_id =
>>> sq.task_id AND task_instance.execution_date = sq.max_ti
>>>
>>> Taking 10-20s
>>>
>>> On Mon, Jan 23, 2017 at 2:05 PM, Chris Riccomini <criccom...@apache.org>
>>> wrote:
>>>
>>>> Can confirm it's a slow query on task_instance table. Still digging.
>>>> Unfortunately, the query is truncated in my UI right now:
>>>>
>>>> SELECT task_instance.task_id AS task_instance_...
>>>>
>>>> On Mon, Jan 23, 2017 at 1:56 PM, Chris Riccomini <criccom...@apache.org
>>>> > wrote:
>>>>
>>>>> Digging. Might be a bit.
>>>>>
>>>>> On Mon, Jan 23, 2017 at 1:32 PM, Bolke de Bruin <bdbr...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Slow query log? Db load?
>>>>>>
>>>>>> B.
>>>>>>
>>>>>> Verstuurd vanaf mijn iPad
>>>>>>
>>>>>> > Op 23 jan. 2017 om 21:59 heeft Chris Riccomini <
>>>>>> criccom...@apache.org> het volgende geschreven:
>>>>>> >
>>>>>> > Note: 6.5 million TIs in the task_instance table.
>>>>>> >
>>>>>> > On Mon, Jan 23, 2017 at 12:58 PM, Chris Riccomini <
>>>>>> criccom...@apache.org>
>>>>>> > wrote:
>>>>>> >
>>>>>> >> Hey Bolke,
>>>>>> >>
>>>>>> >> Re: system usage, it's pretty quiet <5% CPU usage. Mem is almost
>>>>>> all free
>>>>>> >> as well.
>>>>>> >>
>>>>>> >> I am thinking that this is DB related, given that it's pausing when
>>>>>> >> executing an update. Was looking at the update_state method in
>>>>>> models.py,
>>>>>> >> which logs right before the 15s pause.
>>>>>> >>
>>>>>> >> Cheers,
>>>>>> >> Chris
>>>>>> >>
>>>>>> >> On Mon, Jan 23, 2017 at 12:51 PM, Bolke de Bruin <
>>>>>> bdbr...@gmail.com>
>>>>>> >> wrote:
>>>>>> >>
>>>>>> >>> Hi Chris,
>>>>>> >>>
>>>>>> >>> What is the load on your machine? (CPU/IO/MEM) It seems that the
>>>>>> executor
>>>>>> >>> is faster in checking the state than the TaskInstance is able to
>>>>>> exit
>>>>>> >>> itself. No, I don’t consider it normal, but it was sometimes
>>>>>> reported (
>>>>>> >>> https://github.com/apache/incubator-airflow/pull/1821) though
>>>>>> not really
>>>>>> >>> replicable as of yet.
>>>>>> >>>
>>>>>> >>> Parsing seems exceptionally slow, it might be worth looking at it
>>>>>> with a
>>>>>> >>> debugger. Very faint guess might be that something with the
>>>>>> multiprocessing
>>>>>> >>> part could do something with memory that is costly, but then we
>>>>>> need to
>>>>>> >>> know more about what is running on the system. Never clue of
>>>>>> system metrics
>>>>>> >>> could be helpful here.
>>>>>> >>>
>>>>>> >>> Bolke
>>>>>> >>>
>>>>>> >>>> On 23 Jan 2017, at 21:34, Chris Riccomini <criccom...@apache.org
>>>>>> >
>>>>>> >>> wrote:
>>>>>> >>>>
>>>>>> >>>> Also, seeing this in EVERY task that runs:
>>>>>> >>>>
>>>>>> >>>> [2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this
>>>>>> >>>> instance has been externally set to queued. Taking the poison
>>>>>> pill. So
>>>>>> >>>> long.
>>>>>> >>>> [2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with
>>>>>> return
>>>>>> >>> code 0
>>>>>> >>>>
>>>>>> >>>>
>>>>>> >>>> All successful tasks are showing this at the end of their logs.
>>>>>> Is this
>>>>>> >>>> normal?
>>>>>> >>>>
>>>>>> >>>> On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini <
>>>>>> >>> criccom...@apache.org>
>>>>>> >>>> wrote:
>>>>>> >>>>
>>>>>> >>>>> Hey all,
>>>>>> >>>>>
>>>>>> >>>>> I've upgraded on production. Things seem to be working so far
>>>>>> (only
>>>>>> >>> been
>>>>>> >>>>> an hour), but I am seeing this in the scheduler logs:
>>>>>> >>>>>
>>>>>> >>>>> File Path
>>>>>> >>> PID
>>>>>> >>>>> Runtime    Last Runtime    Last Run
>>>>>> >>>>> ------------------------------------------------------------
>>>>>> ------
>>>>>> >>> -----
>>>>>> >>>>> ---------  --------------  -------------------
>>>>>> >>>>> ...
>>>>>> >>>>> /etc/airflow/dags/dags/elt/el/db.py
>>>>>> >>> 24793
>>>>>> >>>>> 43.41s     986.63s         2017-01-23T20:04:09
>>>>>> >>>>> ...
>>>>>> >>>>>
>>>>>> >>>>> It seems to be taking more than 15 minutes to parse this DAG.
>>>>>> Any idea
>>>>>> >>>>> what's causing this? Scheduler config:
>>>>>> >>>>>
>>>>>> >>>>> [scheduler]
>>>>>> >>>>> job_heartbeat_sec = 5
>>>>>> >>>>> scheduler_heartbeat_sec = 5
>>>>>> >>>>> max_threads = 2
>>>>>> >>>>> child_process_log_directory = /var/log/airflow/scheduler
>>>>>> >>>>>
>>>>>> >>>>> The db.py file, itself, doesn't interact with any outside
>>>>>> systems, so I
>>>>>> >>>>> would have expected this to not take so long. It does, however,
>>>>>> >>>>> programmatically generate many DAGs within the single .py file.
>>>>>> >>>>>
>>>>>> >>>>> A snippet of the scheduler log is here:
>>>>>> >>>>>
>>>>>> >>>>> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669
>>>>>> e8ffd65
>>>>>> >>>>>
>>>>>> >>>>> Note how there are 10-15 second gaps occasionally. Any idea
>>>>>> what's
>>>>>> >>> going
>>>>>> >>>>> on?
>>>>>> >>>>>
>>>>>> >>>>> Cheers,
>>>>>> >>>>> Chris
>>>>>> >>>>>
>>>>>> >>>>> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <
>>>>>> bdbr...@gmail.com>
>>>>>> >>> wrote:
>>>>>> >>>>>
>>>>>> >>>>>> I created:
>>>>>> >>>>>>
>>>>>> >>>>>> - AIRFLOW-791: At start up all running dag_runs are being
>>>>>> checked, but
>>>>>> >>>>>> not fixed
>>>>>> >>>>>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but
>>>>>> don’t get
>>>>>> >>> fixed
>>>>>> >>>>>> - AIRFLOW-788: Context unexpectedly added to hive conf
>>>>>> >>>>>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date /
>>>>>> >>> interval
>>>>>> >>>>>> was specified
>>>>>> >>>>>>
>>>>>> >>>>>> I created AIRFLOW-789 to update UPDATING.md, it is also out as
>>>>>> a PR.
>>>>>> >>>>>>
>>>>>> >>>>>> Please note that I don't consider any of these blockers for a
>>>>>> release
>>>>>> >>> of
>>>>>> >>>>>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for
>>>>>> an RC
>>>>>> >>> on Feb
>>>>>> >>>>>> 2. However if people are using a restarting scheduler
>>>>>> (run_duration
>>>>>> >>> is set)
>>>>>> >>>>>> and have a lot of running dag runs they won’t like
>>>>>> AIRFLOW-791. So a
>>>>>> >>>>>> workaround for this would be nice (we just updated dag_runs
>>>>>> directly
>>>>>> >>> in the
>>>>>> >>>>>> database to say ‘finished’ before a certain date, but we are
>>>>>> also not
>>>>>> >>> using
>>>>>> >>>>>> the run_duration).
>>>>>> >>>>>>
>>>>>> >>>>>> Bolke
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>> On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com>
>>>>>> wrote:
>>>>>> >>>>>>>
>>>>>> >>>>>>> Will do. And thanks.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Adding another issue:
>>>>>> >>>>>>>
>>>>>> >>>>>>> * Some of our DAGs are not getting scheduled for some unknown
>>>>>> reason.
>>>>>> >>>>>>> Need to investigate why.
>>>>>> >>>>>>>
>>>>>> >>>>>>> Related but not root cause:
>>>>>> >>>>>>> * Logging is so chatty that it gets really hard to find the
>>>>>> real
>>>>>> >>> issue
>>>>>> >>>>>>>
>>>>>> >>>>>>> Bolke.
>>>>>> >>>>>>>
>>>>>> >>>>>>>> On 20 Jan 2017, at 23:45, Dan Davydov <
>>>>>> dan.davy...@airbnb.com
>>>>>> >>> .INVALID>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> I'd be happy to lend a hand fixing these issues and
>>>>>> hopefully some
>>>>>> >>>>>> others
>>>>>> >>>>>>>> are too. Do you mind creating jiras for these since you have
>>>>>> the
>>>>>> >>> full
>>>>>> >>>>>>>> context? I have created a JIRA for (1) and have assigned it
>>>>>> to
>>>>>> >>> myself:
>>>>>> >>>>>>>> https://issues.apache.org/jira/browse/AIRFLOW-780
>>>>>> >>>>>>>>
>>>>>> >>>>>>>> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <
>>>>>> bdbr...@gmail.com>
>>>>>> >>>>>> wrote:
>>>>>> >>>>>>>>
>>>>>> >>>>>>>>> This is to report back on some of the (early) experiences
>>>>>> we have
>>>>>> >>> with
>>>>>> >>>>>>>>> Airflow 1.8.0 (beta 1 at the moment):
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> 1. The UI does not show faulty DAG, leading to confusion for
>>>>>> >>>>>> developers.
>>>>>> >>>>>>>>> When a faulty dag is placed in the dags folder the UI would
>>>>>> report
>>>>>> >>> a
>>>>>> >>>>>>>>> parsing error. Now it doesn’t due to the separate parising
>>>>>> (but not
>>>>>> >>>>>>>>> reporting back errors)
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
>>>>>> >>>>>>>>> We run in a secure environment which requires this variable
>>>>>> to be
>>>>>> >>>>>>>>> whitelisted if it is modified (needs to be added to
>>>>>> UPDATING.md)
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> 3. DagRuns do not exist for certain tasks, but don’t get
>>>>>> fixed
>>>>>> >>>>>>>>> Log gets flooded without a suggestion what to do
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> 4. At start up all running dag_runs are being checked, we
>>>>>> seemed to
>>>>>> >>>>>> have a
>>>>>> >>>>>>>>> lot of “left over” dag_runs (couple of thousand)
>>>>>> >>>>>>>>> - Checking was logged to INFO -> requires a fsync for every
>>>>>> log
>>>>>> >>>>>> message
>>>>>> >>>>>>>>> making it very slow
>>>>>> >>>>>>>>> - Checking would happen at every restart, but dag_runs’
>>>>>> states were
>>>>>> >>>>>> not
>>>>>> >>>>>>>>> being updated
>>>>>> >>>>>>>>> - These dag_runs would never er be marked anything else than
>>>>>> >>> running
>>>>>> >>>>>> for
>>>>>> >>>>>>>>> some reason
>>>>>> >>>>>>>>> -> Applied work around to update all dag_run in sql before a
>>>>>> >>> certain
>>>>>> >>>>>> date
>>>>>> >>>>>>>>> to -> finished
>>>>>> >>>>>>>>> -> need to investigate why dag_runs did not get marked
>>>>>> >>>>>> “finished/failed”
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>> 5. Our umask is set to 027
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>>>
>>>>>> >>>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>>
>>>>>> >>>>>
>>>>>> >>>
>>>>>> >>>
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Experiences with 1.8.0

Reply via email to