Hi Jarek, Friendly bump this thread. What's your thoughts on having a scheduler perf test before each release and incorporating this metric?
Also, is there a devops git repo to put these files/logics? Thanks, Ping On Wed, Jul 13, 2022 at 9:47 AM Ping Zhang <pin...@umich.edu> wrote: > Hi Jarek, > > Yep, it is more useful in the stress test stage before releasing a new > version with some extra set up to ensure no scheduler performance > degradation due to a release. This can also help to find the scaling limit > of the scheduler with a certain SLA, like upper limit of the number of > tasks in a dag, total number of dag files in a cluster, concurrent running > dag runs etc. > > Very good point about synthetic dag files in the stress test, our team is > working on a stress test framework that can directly use all production dag > files to ensure the stress test has the same set of prod dags, but it will > skip the task execution. It can also generate different kinds of dag > (including number of tasks, levels etc). > > Monitoring the production issues for particular DAGs, time of the day is a > different issue. I agree that in prod, we should not let the scheduler > calculate the `dependency met` time. > > > Thanks, > > Ping > > > On Tue, Jul 12, 2022 at 11:02 AM Jarek Potiuk <ja...@potiuk.com> wrote: > >> I think if we limit it to stress tests, this could be an "extra" >> addition - not even necessarily part of Airflow codebase and adding >> triggers with a script, on a single database, some kind of >> test-harness that you always add after you installed airflow in test >> environment - for that I have far less reservations to use triggers. >> >> But if we want to measure the delays in production, that's quite a >> different story (and different purpose): >> >> * The stress tests are synthetic and basically what you will get out >> of it is "are worse/better in this version than in the previous one"? >> "How much", "Which synthetic scenarios are affected most" . Those will >> be done with a few synthetic kinds of traffic/load/shape. >> * The production is different - you really want to see if you have >> some problems with particular DAGs, times of the day, week, load etc >> and you should be able to take some corrective actions ( for example >> increase number of schedulers, or queues, split your dags etc.) - so >> even the "scheduling delay" metrics might sound familiar you might >> want to use completely different dimensions to look at it (how about >> this DAG? this time of day, this group of dags, this type of workloads >> etc). >> >> I think those two might even be separated and calculated differently >> (though having a single approach would be likely better). I am not >> entirely sure but I have a feeling we do not need the scheduler to >> calculate the "dependency met" while scheduling. I think for >> production purposes, it would be much better (less overhead) to simply >> emit "raw" mettrics such as task start/end time of each task plus >> possibly simple publishing of - mostly static - task dependency rules >> - then "dependency met" time can be calculated offline based on joined >> data. That would be roughly equivalent to what you have in the >> trigger, but without the overhead of triggers- simply instead of >> storing the events in metadata db we would emit them (for example >> using otel) and let the external system aggregate them and process it >> offline independently. >> >> The OTEL integration is rather lightweight - most of them use >> in-memory buffers and efficiently push the data (and even can >> implement scalable forwarding of the data and pre-aggregation). The >> nice thing about it is that it can scale much easier. I think that >> (apart of my earlier reservation) database-trigger approach has this >> not-nice property that the less workers and schedulers you have, the >> more "centralized overhead" you have, where the distributed OTEL >> solution scales together with the system adding more or less fixed >> overhead per component (providing that the remote telemetry service is >> also scalable). This makes the trigger approach far less suitable IMHO >> as we are getting dangerously close to Heisen-Monitoring where the >> more we observe the system the more we impact its performance. >> >> J. >> >> On Tue, Jul 12, 2022 at 6:49 PM Ping Zhang <pin...@umich.edu> wrote: >> > >> > Hi Jarek, >> > >> > Thanks for the insights and pointing out the potential issues with >> triggers in the prod with scheduler HA setup. >> > >> > The solution that I proposed is mainly for the stress test scheduler >> before each airflow release. We can make changes in the airflow codebase to >> emit this metric however: >> > >> > 1. It will incur additional overhead for the scheduler to compute the >> metric as scheduler needs to compute the dependency met time of a task. >> > 2. It couples with the implementation of the scheduler. For example, >> from 1.10.4 to airflow 2, the scheduler has changed a lot. If the metric is >> emitted from the scheduler, when making the changes in the scheduler, it >> also needs to update how the metric is computed and emitted. >> > >> > Thus, I think having it out of the airflow core makes it easier to >> compare the scheduling delay across different airflow versions. >> > >> > Thanks for pointing out the OpenTelemetry, let me check it out. >> > >> > Thanks, >> > >> > Ping >> > >> > >> > On Mon, Jul 11, 2022 at 9:44 AM Jarek Potiuk <pot...@apache.org> wrote: >> >> >> >> Sorry for the late reply - Ping. >> >> >> >> TL;DR; I think the metrics might be useful but I think using triggers >> >> is asking for troubles. >> >> >> >> While using triggers sounds like a common approach in a number of >> >> installations, we do not use triggers so far. >> >> Using Triggers moves some logic to the database, and in our case we do >> >> not have it at all - all logic is in Airflow, and we keep it there, >> >> the database for us is merely "state" storage and "locks". Adding >> >> database triggers, extends it to also keep some logic there. And >> >> adding triggers has some worrying "implicitness" which goes against >> >> the "Explicit is better than Implicit" Zen of Python. >> >> >> >> One thing that makes me think "coldly" about this is that it might >> >> have some undesired side effects - such as synchronizing of changes >> >> from multiple schedulers on trying to insert such audit entry (you >> >> need to create an index lock when you insert rows to a table which has >> >> a primary key/unique indexes). >> >> >> >> And what's even more worrying is that we are using SQLAlchemy and >> >> MySQl/MsSQL/Postgres and we should make sure it works the same in all >> >> of them. This is troublesome. >> >> >> >> Even if we could solve and verify all those problems individually the >> >> effect is - Once we open the "gate" of triggers, we will get more "ok >> >> we have trigger here so let's also use it for that and this" and this >> >> will be hard to say "no" if we already have a precedent, and this >> >> might lead to more and more logica and features deferred to a database >> >> logic (and my past experience is that it leads to more complexity and >> >> implicit behaviours that are difficult to reason about). >> >> >> >> But this is only about the technical details of this, not the metrics >> >> itself. I think the metric you proposed is very useful. >> >> >> >> I think however (correct me if I am wrong) - that we do not need >> >> database triggers for any of those. I have a feeling that this >> >> proposal is trying to implement the (useful) metrics with very limited >> >> modification to the Airflow code, so I can understand that you might >> >> think about it this way when you have your own fork - then it makes >> >> sense to piggyback on the existing database and use triggers, because >> >> you do not want to modify Airflow code. >> >> >> >> But here - we are in a completely different situation. We CAN modify >> >> Airflow code and add missing features and functionality to capture the >> >> necessary metric data in the code, rather than using triggers. We >> >> could even define some kind of callbacks for the auditing events that >> >> would allow us to gather those metrics in a way that does not even use >> >> the database to store the information for the metrics. >> >> >> >> In fact - this leads me to conclusion that we should implement the >> >> metrics you mention as part of our Open-Telemetry effort >> >> >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-49+OpenTelemetry+Support+for+Apache+Airflow >> . >> >> This is precisely what it was prepared for, once we have >> >> Open-Telemetry integrated we could add more and more such useful >> >> metrics more easily, and that could be way more useful, because >> >> instead of running external custom-db-reading process for that, we >> >> could not only calculate such metrics using the right metrics tooling >> >> (each company could use their preferred open-telemetry compliant >> >> tool), but that would open up all the features like alerting, >> >> connecting it with traces and other metrics etc. etc. >> >> >> >> Howard - WDYT? >> >> >> >> J. >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Thu, Jun 30, 2022 at 4:52 PM Vikram Koka >> >> <vik...@astronomer.io.invalid> wrote: >> >> > >> >> > HI Ping, >> >> > >> >> > Apologies for the belated response. >> >> > >> >> > We have created a set of stress test DAGs where the tasks take >> almost no time to execute at all, so that the worker task execution time is >> small, and the stress is on the Scheduler and Executor. >> >> > >> >> > We then calculate "task latency" aka "task lag" as: >> >> > ti_lag = ti.start_date - max_upstream_ti_end_date >> >> > This is effectively the time between "the downstream task starting" >> and "the last dependent upstream task complete" >> >> > >> >> > We don't use the tasks that don't have any upstream tasks in this >> metric for measuring task lag. >> >> > And for tasks that have multiple upstream tasks, we use the upstream >> task for which the end_date took maximum time as the scheduler waits for >> completion of all parent tasks before scheduling any downstream task. >> >> > >> >> > Vikram >> >> > >> >> > >> >> > On Wed, Jun 8, 2022 at 2:58 PM Ping Zhang <pin...@umich.edu> wrote: >> >> >> >> >> >> Hi Mehta, >> >> >> >> >> >> Good point. The primary goal of the metric is for stress testing to >> catch airflow scheduler performance regression for 1) our internal >> scheduler improvement work and 2) airflow version upgrade. >> >> >> >> >> >> One of the key benefits of this metric definition is it is >> independent from the scheduler implementation and it can be >> computed/backfilled offline. >> >> >> >> >> >> Currently, we expose it to the datadog and we (the airflow cluster >> maintainers) are the main users for it. >> >> >> >> >> >> Thanks, >> >> >> >> >> >> Ping >> >> >> >> >> >> >> >> >> On Wed, Jun 8, 2022 at 2:36 PM Mehta, Shubham >> <shu...@amazon.com.invalid> wrote: >> >> >>> >> >> >>> Ping, >> >> >>> >> >> >>> >> >> >>> >> >> >>> I’m very interested in this as well. A good metric can help us >> benchmark and identify potential improvements in the scheduler performance. >> >> >>> In order to understand the proposal better, can you please share >> where and how do you intend to use “Scheduling delay”? Is it meant for >> benchmarking or stress testing only? Do you plan to expose it to the users >> in the Airflow UI? >> >> >>> >> >> >>> >> >> >>> >> >> >>> Thanks >> >> >>> Shubham >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> From: Ping Zhang <pin...@umich.edu> >> >> >>> Reply-To: "dev@airflow.apache.org" <dev@airflow.apache.org> >> >> >>> Date: Wednesday, June 8, 2022 at 11:58 AM >> >> >>> To: "dev@airflow.apache.org" <dev@airflow.apache.org>, " >> vik...@astronomer.io" <vik...@astronomer.io> >> >> >>> Subject: RE: [EXTERNAL][DISCUSS] Airflow Scheduling Delay Metric >> Definition >> >> >>> >> >> >>> >> >> >>> >> >> >>> CAUTION: This email originated from outside of the organization. >> Do not click links or open attachments unless you can confirm the sender >> and know the content is safe. >> >> >>> >> >> >>> >> >> >>> >> >> >>> Hi Vikram, >> >> >>> >> >> >>> >> >> >>> >> >> >>> Thanks for pointing that out, 'task latency', >> >> >>> >> >> >>> >> >> >>> >> >> >>> "we define task latency as the time it takes for a task to begin >> executing once its dependencies have been met." >> >> >>> >> >> >>> >> >> >>> >> >> >>> It will be great if you can elaborate more about "begin executing" >> and how you calculate "its dependencies have been met.". >> >> >>> >> >> >>> >> >> >>> >> >> >>> If the 'begin executing' means the state of ti becomes running, >> then the 'Scheduling Delay' metric focuses on the overhead introduced by >> the scheduler. >> >> >>> >> >> >>> >> >> >>> >> >> >>> In our prod and stress test, we use the `task_instance_audit` >> table ( a new row is created whenever there is state change in >> task_instance table) to compute the time of a ti should be scheduled. >> >> >>> >> >> >>> >> >> >>> >> >> >>> Thanks, >> >> >>> >> >> >>> >> >> >>> >> >> >>> Ping >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Wed, Jun 8, 2022 at 11:25 AM Vikram Koka >> <vik...@astronomer.io.invalid> wrote: >> >> >>> >> >> >>> Ping, >> >> >>> >> >> >>> >> >> >>> >> >> >>> I am quite interested in this topic and trying to understand the >> difference between the "scheduling delay" metric articulated as compared to >> the "task latency" aka "task lag" metric which we have been using before. >> >> >>> >> >> >>> >> >> >>> >> >> >>> As you may recall, we have been using two specific metrics to >> benchmark Scheduler performance, specifically "task latency" and "task >> throughput" since Airflow 2.0. >> >> >>> >> >> >>> These were described in the 2.0 Scheduler blog post >> >> >>> Specifically, within that we defined task tatency as the time it >> takes for the task to begin executing once it's dependencies are all met. >> >> >>> >> >> >>> >> >> >>> >> >> >>> Thanks, >> >> >>> >> >> >>> Vikram >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> >> >> >>> On Wed, Jun 8, 2022 at 10:25 AM Ping Zhang <pin...@umich.edu> >> wrote: >> >> >>> >> >> >>> Hi Airflow Community, >> >> >>> >> >> >>> >> >> >>> >> >> >>> Airflow is a scheduling platform for data pipelines, however there >> is no good metric to measure the scheduling delay in the production and >> also the stress test environment. This makes it hard to catch regressions >> in the scheduler during the stress test stage. >> >> >>> >> >> >>> >> >> >>> I would like to propose an airflow scheduling delay metric >> definition. Here is the detailed design of the metric and its >> implementation: >> >> >>> >> >> >>> >> https://docs.google.com/document/d/1NhO26kgWkIZJEe50M60yh_jgROaU84dRJ5qGFqbkNbU/edit?usp=sharing >> >> >>> >> >> >>> Please take a look and any feedback is welcome. >> >> >>> >> >> >>> >> >> >>> >> >> >>> Thanks, >> >> >>> >> >> >>> >> >> >>> >> >> >>> Ping >> >> >>> >> >> >>> >> >