Re: Experiences with 1.8.0

Chris Riccomini Mon, 23 Jan 2017 12:35:36 -0800

Also, seeing this in EVERY task that runs:

[2017-01-23 20:26:13,777] {jobs.py:2112} WARNING - State of this
instance has been externally set to queued. Taking the poison pill. So
long.
[2017-01-23 20:26:13,841] {jobs.py:2051} INFO - Task exited with return code 0



All successful tasks are showing this at the end of their logs. Is this
normal?

On Mon, Jan 23, 2017 at 12:27 PM, Chris Riccomini <criccom...@apache.org>
wrote:

> Hey all,
>
> I've upgraded on production. Things seem to be working so far (only been
> an hour), but I am seeing this in the scheduler logs:
>
> File Path                                                             PID
>  Runtime    Last Runtime    Last Run
> ------------------------------------------------------------------  -----
>  ---------  --------------  -------------------
> ...
> /etc/airflow/dags/dags/elt/el/db.py                                 24793
>  43.41s     986.63s         2017-01-23T20:04:09
> ...
>
> It seems to be taking more than 15 minutes to parse this DAG. Any idea
> what's causing this? Scheduler config:
>
> [scheduler]
> job_heartbeat_sec = 5
> scheduler_heartbeat_sec = 5
> max_threads = 2
> child_process_log_directory = /var/log/airflow/scheduler
>
> The db.py file, itself, doesn't interact with any outside systems, so I
> would have expected this to not take so long. It does, however,
> programmatically generate many DAGs within the single .py file.
>
> A snippet of the scheduler log is here:
>
> https://gist.github.com/criccomini/a2b2762763c8ba65fefcdd669e8ffd65
>
> Note how there are 10-15 second gaps occasionally. Any idea what's going
> on?
>
> Cheers,
> Chris
>
> On Sun, Jan 22, 2017 at 4:42 AM, Bolke de Bruin <bdbr...@gmail.com> wrote:
>
>> I created:
>>
>> - AIRFLOW-791: At start up all running dag_runs are being checked, but
>> not fixed
>> - AIRFLOW-790: DagRuns do not exist for certain tasks, but don’t get fixed
>> - AIRFLOW-788: Context unexpectedly added to hive conf
>> - AIRFLOW-792: Allow fixing of schedule when wrong start_date / interval
>> was specified
>>
>> I created AIRFLOW-789 to update UPDATING.md, it is also out as a PR.
>>
>> Please note that I don't consider any of these blockers for a release of
>> 1.8.0 and can be fixed in 1.8.1 - so we are still on track for an RC on Feb
>> 2. However if people are using a restarting scheduler (run_duration is set)
>> and have a lot of running dag runs they won’t like AIRFLOW-791. So a
>> workaround for this would be nice (we just updated dag_runs directly in the
>> database to say ‘finished’ before a certain date, but we are also not using
>> the run_duration).
>>
>> Bolke
>>
>>
>>
>> > On 20 Jan 2017, at 23:55, Bolke de Bruin <bdbr...@gmail.com> wrote:
>> >
>> > Will do. And thanks.
>> >
>> > Adding another issue:
>> >
>> > * Some of our DAGs are not getting scheduled for some unknown reason.
>> > Need to investigate why.
>> >
>> > Related but not root cause:
>> > * Logging is so chatty that it gets really hard to find the real issue
>> >
>> > Bolke.
>> >
>> >> On 20 Jan 2017, at 23:45, Dan Davydov <dan.davy...@airbnb.com.INVALID>
>> wrote:
>> >>
>> >> I'd be happy to lend a hand fixing these issues and hopefully some
>> others
>> >> are too. Do you mind creating jiras for these since you have the full
>> >> context? I have created a JIRA for (1) and have assigned it to myself:
>> >> https://issues.apache.org/jira/browse/AIRFLOW-780
>> >>
>> >> On Fri, Jan 20, 2017 at 1:01 AM, Bolke de Bruin <bdbr...@gmail.com>
>> wrote:
>> >>
>> >>> This is to report back on some of the (early) experiences we have with
>> >>> Airflow 1.8.0 (beta 1 at the moment):
>> >>>
>> >>> 1. The UI does not show faulty DAG, leading to confusion for
>> developers.
>> >>> When a faulty dag is placed in the dags folder the UI would report a
>> >>> parsing error. Now it doesn’t due to the separate parising (but not
>> >>> reporting back errors)
>> >>>
>> >>> 2. The hive hook sets ‘airflow.ctx.dag_id’ in hive
>> >>> We run in a secure environment which requires this variable to be
>> >>> whitelisted if it is modified (needs to be added to UPDATING.md)
>> >>>
>> >>> 3. DagRuns do not exist for certain tasks, but don’t get fixed
>> >>> Log gets flooded without a suggestion what to do
>> >>>
>> >>> 4. At start up all running dag_runs are being checked, we seemed to
>> have a
>> >>> lot of “left over” dag_runs (couple of thousand)
>> >>> - Checking was logged to INFO -> requires a fsync for every log
>> message
>> >>> making it very slow
>> >>> - Checking would happen at every restart, but dag_runs’ states were
>> not
>> >>> being updated
>> >>> - These dag_runs would never er be marked anything else than running
>> for
>> >>> some reason
>> >>> -> Applied work around to update all dag_run in sql before a certain
>> date
>> >>> to -> finished
>> >>> -> need to investigate why dag_runs did not get marked
>> “finished/failed”
>> >>>
>> >>> 5. Our umask is set to 027
>> >>>
>> >>>
>> >
>>
>>
>

Re: Experiences with 1.8.0

Reply via email to