Hi Jarek,

Friendly bump this thread. What's your thoughts on having a scheduler perf
test before each release and incorporating this metric?

Also, is there a devops git repo to put these files/logics?

Thanks,

Ping


On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pin...@umich.edu> wrote:

> Hi Jarek,
>
> Yep, it is more useful in the stress test stage before releasing a new
> version with some extra set up to ensure no scheduler performance
> degradation due to a release. This can also help to find the scaling limit
> of the scheduler with a certain SLA, like upper limit of the number of
> tasks in a dag, total number of dag files in a cluster, concurrent running
> dag runs etc.
>
> Very good point about synthetic dag files in the stress test, our team is
> working on a stress test framework that can directly use all production dag
> files to ensure the stress test has the same set of prod dags, but it will
> skip the task execution. It can also generate different kinds of dag
> (including number of tasks, levels etc).
>
> Monitoring the production issues for particular DAGs, time of the day is a
> different issue. I agree that in prod, we should not let the scheduler
> calculate the `dependency met` time.
>
>
> Thanks,
>
> Ping
>
>
> On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>> I think if we limit it to stress tests, this could be an "extra"
>> addition - not even necessarily part of Airflow codebase and adding
>> triggers with a script, on a single database, some kind of
>> test-harness that you always add after you installed airflow in test
>> environment - for that I have far less reservations to use triggers.
>>
>> But if we want to measure the delays in production, that's quite a
>> different story (and different purpose):
>>
>> * The stress tests are synthetic and basically what you will get out
>> of it is "are worse/better in this version than in the previous one"?
>> "How much", "Which synthetic scenarios are affected most" . Those will
>> be done with a few synthetic kinds of traffic/load/shape.
>> * The production is different - you really want to see if you have
>> some problems with particular DAGs, times of the day, week, load etc
>> and you should be able to take some corrective actions ( for example
>> increase number of schedulers, or queues, split your dags etc.) - so
>> even the "scheduling delay" metrics might sound familiar you might
>> want to use completely different dimensions to look at it (how about
>> this DAG? this time of day, this group of dags, this type of workloads
>> etc).
>>
>> I think those two might even be separated and calculated differently
>> (though having a single approach would be likely better). I am not
>> entirely sure but I have a feeling we do not need the scheduler to
>> calculate the "dependency met" while scheduling. I think for
>> production purposes, it would be much better (less overhead) to simply
>> emit "raw" mettrics such as task start/end time of each task plus
>> possibly simple publishing of - mostly static - task dependency rules
>> - then "dependency met" time can be calculated offline based on joined
>> data. That would be roughly equivalent to what you have in the
>> trigger, but without the overhead of triggers- simply instead of
>> storing the events in metadata db we would emit them (for example
>> using otel) and let the external system aggregate them and process it
>> offline independently.
>>
>> The OTEL integration is rather lightweight - most of them use
>> in-memory buffers and efficiently push the data (and even can
>> implement scalable forwarding of the data and pre-aggregation). The
>> nice thing about it is that it can scale much easier. I think that
>> (apart of my earlier reservation) database-trigger approach has this
>> not-nice property that the less workers and schedulers you have, the
>> more "centralized overhead" you have, where the distributed OTEL
>> solution scales together with the system adding more or less fixed
>> overhead per component (providing that the remote telemetry service is
>> also scalable). This makes the trigger approach far less suitable IMHO
>> as we are getting dangerously close to Heisen-Monitoring where the
>> more we observe the system the more we impact its performance.
>>
>> J.
>>
>> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pin...@umich.edu> wrote:
>> >
>> > Hi Jarek,
>> >
>> > Thanks for the insights and pointing out the potential issues with
>> triggers in the prod with scheduler HA setup.
>> >
>> > The solution that I proposed is mainly for the stress test scheduler
>> before each airflow release. We can make changes in the airflow codebase to
>> emit this metric however:
>> >
>> > 1. It will incur additional overhead for the scheduler to compute the
>> metric as scheduler needs to compute the dependency met time of a task.
>> > 2. It couples with the implementation of the scheduler. For example,
>> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is
>> emitted from the scheduler, when making the changes in the scheduler, it
>> also needs to update how the metric is computed and emitted.
>> >
>> > Thus, I think having it out of the airflow core makes it easier to
>> compare the scheduling delay across different airflow versions.
>> >
>> > Thanks for pointing out the OpenTelemetry, let me check it out.
>> >
>> > Thanks,
>> >
>> > Ping
>> >
>> >
>> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <pot...@apache.org> wrote:
>> >>
>> >> Sorry for the late reply - Ping.
>> >>
>> >> TL;DR; I think the metrics might be useful but I think using triggers
>> >> is asking for troubles.
>> >>
>> >> While using triggers sounds like a common approach in a number of
>> >> installations, we do not use triggers so far.
>> >> Using Triggers moves some logic to the database, and in our case we do
>> >> not have it at all - all logic is in Airflow, and we keep it there,
>> >> the database for us is merely "state" storage and "locks". Adding
>> >> database triggers, extends it to also keep some logic there. And
>> >> adding triggers has some worrying "implicitness" which goes against
>> >> the "Explicit is better than Implicit" Zen of Python.
>> >>
>> >> One thing that makes me think "coldly" about this is that it might
>> >> have some undesired side effects - such as synchronizing of changes
>> >> from multiple schedulers on trying to insert such audit entry (you
>> >> need to create an index lock when you insert rows to a table which has
>> >> a primary key/unique indexes).
>> >>
>> >> And what's even more worrying is that we are using SQLAlchemy and
>> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all
>> >> of them. This is troublesome.
>> >>
>> >> Even if we could solve and verify all those problems individually the
>> >> effect is - Once we open the "gate" of triggers, we will get more "ok
>> >> we have trigger here so let's also use it for that and this" and this
>> >> will be hard to say "no" if we already have a precedent, and this
>> >> might lead to more and more logica and features deferred to a database
>> >> logic (and my past experience is that it leads to more complexity and
>> >> implicit behaviours that are difficult to reason about).
>> >>
>> >> But this is only about the technical details of this, not the metrics
>> >> itself. I think the metric you proposed is very useful.
>> >>
>> >> I think however (correct me if I am wrong) - that we do not need
>> >> database triggers for any of those. I have a feeling that this
>> >> proposal is trying to implement the (useful) metrics with very limited
>> >> modification to the Airflow code, so I can understand that you might
>> >> think about it this way when you have your own fork - then it makes
>> >> sense to piggyback on the existing database and use triggers, because
>> >> you do not want to modify Airflow code.
>> >>
>> >> But here - we are in a completely different situation. We CAN modify
>> >> Airflow code and add missing features and functionality to capture the
>> >> necessary metric data in the code,  rather than using triggers. We
>> >> could even define some kind of callbacks for the auditing events that
>> >> would allow us to gather those metrics in a way that does not even use
>> >> the database to store the information for the metrics.
>> >>
>> >> In fact - this leads me to conclusion that we should implement the
>> >> metrics you mention as part of our Open-Telemetry effort
>> >>
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow
>> .
>> >> This is precisely what it was prepared for, once we have
>> >> Open-Telemetry integrated we could add more and more such useful
>> >> metrics more easily, and that could be way more useful, because
>> >> instead of running external custom-db-reading process for that, we
>> >> could not only calculate such metrics using the right metrics tooling
>> >> (each company could use their preferred open-telemetry compliant
>> >> tool), but that would open up all the features like alerting,
>> >> connecting it with traces and other metrics etc. etc.
>> >>
>> >> Howard - WDYT?
>> >>
>> >> J.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka
>> >> <vik...@astronomer.io.invalid> wrote:
>> >> >
>> >> > HI Ping,
>> >> >
>> >> > Apologies for the belated response.
>> >> >
>> >> > We have created a set of stress test DAGs where the tasks take
>> almost no time to execute at all, so that the worker task execution time is
>> small, and the stress is on the Scheduler and Executor.
>> >> >
>> >> > We then calculate "task latency" aka "task lag" as:
>> >> >  ti_lag = ti.start_date - max_upstream_ti_end_date
>> >> > This is effectively the time between "the downstream task starting"
>> and "the last dependent upstream task complete"
>> >> >
>> >> > We don't use the tasks that don't have any upstream tasks in this
>> metric for measuring task lag.
>> >> > And for tasks that have multiple upstream tasks, we use the upstream
>> task for which the end_date took maximum time as the scheduler waits for
>> completion of all parent tasks before scheduling any downstream task.
>> >> >
>> >> > Vikram
>> >> >
>> >> >
>> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pin...@umich.edu> wrote:
>> >> >>
>> >> >> Hi Mehta,
>> >> >>
>> >> >> Good point. The primary goal of the metric is for stress testing to
>> catch airflow scheduler performance regression for 1) our internal
>> scheduler improvement work and 2) airflow version upgrade.
>> >> >>
>> >> >> One of the key benefits of this metric definition is it is
>> independent from the scheduler implementation and it can be
>> computed/backfilled offline.
>> >> >>
>> >> >> Currently, we expose it to the datadog and we (the airflow cluster
>> maintainers) are the main users for it.
>> >> >>
>> >> >> Thanks,
>> >> >>
>> >> >> Ping
>> >> >>
>> >> >>
>> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham
>> <shu...@amazon.com.invalid> wrote:
>> >> >>>
>> >> >>> Ping,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> I’m very interested in this as well. A good metric can help us
>> benchmark and identify potential improvements in the scheduler performance.
>> >> >>> In order to understand the proposal better, can you please share
>> where and how do you intend to use “Scheduling delay”? Is it meant for
>> benchmarking or stress testing only? Do you plan to expose it to the users
>> in the Airflow UI?
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks
>> >> >>> Shubham
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> From: Ping Zhang <pin...@umich.edu>
>> >> >>> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org>
>> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM
>> >> >>> To: "dev@airflow.apache.org" <dev@airflow.apache.org>, "
>> vik...@astronomer.io" <vik...@astronomer.io>
>> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric
>> Definition
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> CAUTION: This email originated from outside of the organization.
>> Do not click links or open attachments unless you can confirm the sender
>> and know the content is safe.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Hi Vikram,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks for pointing that out, 'task latency',
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> "we define task latency as the time it takes for a task to begin
>> executing once its dependencies have been met."
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> It will be great if you can elaborate more about "begin executing"
>> and how you calculate "its dependencies have been met.".
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> If the 'begin executing' means the state of ti becomes running,
>> then the 'Scheduling Delay' metric focuses on the overhead introduced by
>> the scheduler.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> In our prod and stress test, we use the `task_instance_audit`
>> table ( a new row is created whenever there is state change in
>> task_instance table) to compute the time of a ti should be scheduled.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Ping
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka
>> <vik...@astronomer.io.invalid> wrote:
>> >> >>>
>> >> >>> Ping,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> I am quite interested in this topic and trying to understand the
>> difference between the "scheduling delay" metric articulated as compared to
>> the "task latency" aka "task lag" metric which we have been using before.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> As you may recall, we have been using two specific metrics to
>> benchmark Scheduler performance, specifically "task latency" and "task
>> throughput" since Airflow 2.0.
>> >> >>>
>> >> >>> These were described in the 2.0 Scheduler blog post
>> >> >>> Specifically, within that we defined task tatency as the time it
>> takes for the task to begin executing once it's dependencies are all met.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>> Vikram
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pin...@umich.edu>
>> wrote:
>> >> >>>
>> >> >>> Hi Airflow Community,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Airflow is a scheduling platform for data pipelines, however there
>> is no good metric to measure the scheduling delay in the production and
>> also the stress test environment. This makes it hard to catch regressions
>> in the scheduler during the stress test stage.
>> >> >>>
>> >> >>>
>> >> >>> I would like to propose an airflow scheduling delay metric
>> definition. Here is the detailed design of the metric and its
>> implementation:
>> >> >>>
>> >> >>>
>> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing
>> >> >>>
>> >> >>> Please take a look and any feedback is welcome.
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>>
>> >> >>>
>> >> >>> Ping
>> >> >>>
>> >> >>>
>>
>

Reply via email to