This is obviously a subject dear to my heart :)

As I mentioned in that article I think it's critical both for performance 
testing of changes, and "regression" testing of releases that we have a 
repeatable performance test framework. I somewhat simplified the script I used 
in the article, so I need to tidy it up a bit more and work out where to put it.

I've created an epic for "scheduler-performance", mostly for my own tracking 
https://issues.apache.org/jira/browse/AIRFLOW-5929

Better metrics is definitely useful for the "operability" of an Airflow 
cluster, but it usually lags behind in having enough metrics, as the way it 
usually works is metrics get added _after_ someone goes "gah, something funny 
is going on here, if only we had a metric X I would have seen it" :) So some 
concerted work to add more of them up front would be great.

I would add a third class of performance work which is benchmarking specific 
bits of code -- it's not quite instrumentation as it's not left in the run-time 
code, but it is useful for be able to benchmark specific parts of the code with 
a repeatable circumstances (particularly relevant for the scheduler, where the 
structure of the dag, the state of other dags, and the number of active runs 
all change the performance of code) so there is a lot of "boiler-plate"/setup 
code to get the DB in the right state.

One tool I've found very useful for capturing and visualising statsd metrics 
locally is called "netdata" - it's like prometheus, statsd and grafana all 
rolled in to one, and is amazing for being a single binary that will capture 
metrics and display dashboards. I'm working on porting our Airflow Grafana 
dashboards in to the netadata format for local development.  For example here 
is one of their live demos: https://london.my-netdata.io/default.html 
<https://london.my-netdata.io/default.html> 

Kaxil and I also have a large DAG (something like 1000 tasks overall with many 
subdags) that we were using for testing of the DAG serialization workflow

I don't think we need an AIP for this though, non of the changes are 
contentious, or likely to require huge changes to existing code base, so I 
think we can just get started on them, and ask for review for from a wider dev 
audience if we think a change warrants it.

Such as this change here: https://github.com/apache/airflow/pull/6792 
<https://github.com/apache/airflow/pull/6792> - SQL Alchemy has a way to cache 
the SQL of queries (not the results, just the SQL) which I identified as a 
bottleneck. The syntax is a little bit odd, and you have to make sure you 
"prime" the query for things run in a sub-process to get the full benefit 
(which I will do in a nicer way), but even then I think this is a worth while 
change.

-ash

> On 11 Dec 2019, at 10:05, Jarek Potiuk <[email protected]> wrote:
> 
> *TL;DR;* We are gearing up @ Polidea to work on Apache Airflow performance
> and I wanted to start a discussion that might lead to creating a new AIP
> and implementing it :). Here is a high-level summary of the discussions we
> had so far, so this might be a starting point to get some details worked
> out and end up with polished AIP.
> 
> *Motivation*
> 
> Airflow has a number of areas that require performance testing and
> currently when releasing new version of Airflow we are not able to reason
> about potential performance impacts, discuss performance degradation
> problems, because we have only anecdotal evidence of performance
> characteristics of Airflow. There is this fantastic post
> <https://www.astronomer.io/blog/profiling-the-airflow-scheduler/> from Ash
> about profiling scheduler, but this is only one part of Airflow and it
> would be great to turn our performance work into regular and managed
> approach.
> 
> *Areas*
> 
> We think about two types of performance streams:
> 
>   - Instrumentation of the code
>   - Synthetic CI-runnable E2E tests.
> 
> Those can be implemented independently, but instrumentation might be of
> great help when synthetic tests are run.
> 
> *Instrumentation*
> 
> Instrumentation is targeted towards Airflow users and DAG developers.
> 
> Instrumentation is mostly about gathering more performance characteristics,
> including numbers related to database queries and performance, latency of
> DAG scheduling, parsing time etc. This all can be exported using current
> statsd interface and visualised using one of the standard metric tools
> (Grafana, Prometheus etc.) and some of this can be surfaced in the existing
> UI where you can see it in the context of actual DAGS (mostly latency
> numbers). Most of it should be back-portable to 1.10.*. It can be used to
> track down performance numbers with some real in-production DAGs.
> 
> Part of the effort should be instructions on how to setup monitoring and
> dashboards for Airflow, document which of the metrics mean what and making
> sure the data is actionable - it should be rather clear for the developers
> how to turn their observations into actions.
> 
> An important part of this will be also to provide a way to export such
> performance information easily and share with community or service
> providers so that they can make more educated reasoning about performance
> problem their users experience.
> 
> *Synthetic CI-runnable E2E tests*
> 
> Synthetic tests are targeted towards Airflow committers/contributors
> 
> The synthetic CI-runnable tests that will be able to produce generic
> performance numbers for the core of Airflow. We can prepare synthetic data
> and run Airflow with NullExecutors, empty tasks, various Executors, using
> different deployments to focus only on the performance of the core Airflow
> itself. We can run those performance tests on already released Airflow
> versions to compare the performance numbers, as well as it has the benefit
> that we can visualise trends, compare versions, and have automated
> performance testing of new releases. Some of the instrumentation changes
> described above (especially the part that will be easily back-portable to
> 1.10.* series) might also be helpful in getting some performance
> characteristics.
> 
> *What's my ask?*
> 
> I'd love to start community discussion on this. Please feel free to add
> your comments and let's start working on it soon. I would love to get some
> interested people and organise a SIG-performance group soon - @Kevin - I
> think you mentioned you would be interested, but anyone else is welcome to
> join as well).
> 
> J.
> 
> 
> -- 
> 
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
> 
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>

Reply via email to