Good news - I managed to debug and fix/workaround the flask auto
instrumentation to work and Melodie should be unblocked.

It was not an easy one to pull - it required a bit of knowledge of how
airflow webserver works under the hood and finding out that the
gunicorn's fork model needs a workaround for open-telemetry
integration.

This makes the need of our open-instrumentation a bit more "complex"
(but only a bit) and slightly more "exporter-specific" - currently we
have hard-coded  Jaeger Exporter - but in the future we should be able
to get it possibly better automated - in the future we might even not
need to do any workarounds once this one:
https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171
(Add integration for Gunicorn) will be implemented in
open-telemetry-python library (maybe even we can contribute it).

You can see the changes I had to implement here:
https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded
and see the comment here for the screenshots from Jaeger:
https://github.com/apache/airflow/pull/20677#issuecomment-1008327884

We are going to proceed with further integration (hopefully with less
of a trouble ) of other existing instrumentations now.

Howard, Nick,

I think what might be helpful (and Howards' product manager view might
be super-helpful) is to define the scope of the integration of the
"Airflow-specific" telemetry. Defining metrics that we would like to
have (starting from the current set of metrics) and later propose some
ways to test it and produce some basica dashboards with some of the
monitoring tools that we could choose. All as a "Proof-Of-Concept"
level. so that we can produce some real example and screenshots how
the open-telemetry integration might  work and what value it might
bring.

The end goal of the internship of Melody is to prepare an Airflow
Improvement Proposal where we could - base on our learning from the
internship propose how the integration would look like.

WDYT ?

J.

On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Yep. Absolutely. We are at the stage now (and this is something we are
> looking at (and I have planned to this weekend) is to see why
> auto-instrumentation of the open-telemetry in the PR of Melody's PR
> does not seem to auto-instrument our Flask integration (we chose flask
> as the first integration that should be "easy" but for whatever reason
> auto-instrumetnation - even in the `--debug` mode of airflow - does
> not seem to work despite everything seemingly "correct".
>
> I plan to take a look today at it and we can discuss it in Melody's
> PR. That would be fantastic if we could work on it together  :).
>
> J.
>
> On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <fluxi...@gmail.com> wrote:
> >
> > Hi nick,
> >
> > You can look at the PR or clone my Fork and try running in your local 
> > environment and see if there’s any way we can improve on the 
> > auto-instrumention
> > Would love to get a feedback.
> > Thank you
> >
> > On Sat, 8 Jan 2022 at 12:19 AM, <nick@shook.family> wrote:
> >>
> >> hi all, been lurking for a while - this is my first post.
> >>
> >> what I like about open telemetry is that you can send all telemetry traces 
> >> to STDOUT (or any logs) which you can then pipe to many log forwarders of 
> >> choice. imo this is the easiest way to set it up and a default that should 
> >> work in the vast majority of airflow use cases.
> >>
> >> the PR looks like a great start! what can I do to help?
> >> ---
> >> nick
> >>
> >> On Jan 7, 2022, at 14:37, Elad Kalif <elad...@apache.org> wrote:
> >>
> >> Hi Howard,
> >>
> >> We actually have outreachy intern (Melodie) that is working on researching 
> >> how open-telemetry can be integrated with Airflow.
> >> Draft PR for demo : https://github.com/apache/airflow/pull/20677
> >> This is an initial effort for a POC.
> >> Maybe you can work together on this?
> >>
> >>
> >> On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo 
> >> <howard....@astronomer.io.invalid> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I’m a staff product manager in Astronomer, and wanted to post this email 
> >>> according to the guide from 
> >>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
> >>>  .
> >>>
> >>> Currently, the main method to publish telemetry data out of airflow is 
> >>> through its statsD implementation : 
> >>> https://github.com/apache/airflow/blob/main/airflow/stats.py , and 
> >>> currently airflow supports two flavors of stated, the original one, and 
> >>> data dog’s dogstatsd implementation.
> >>>
> >>> Through this implementation, we have the following list of metrics that 
> >>> would be available for other popular monitoring tools to collect, 
> >>> monitor, visualize, and alert on metrics generated from airflow: 
> >>> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
> >>>
> >>> There are a number of limitations of airflow’s current implementation of 
> >>> its metrics using stated.
> >>> 1. StatsD is based on simple metrics format that does not support richer 
> >>> contexts. Its metric name would contain some of those contexts (such as 
> >>> dag id, task id, etc), but those can be limited due to the formatting 
> >>> issue of having to be a part of metric name itself. A better approach 
> >>> would be to utilizing ‘tags’ to be attached to the metrics data to add 
> >>> more contexts.
> >>> 2. StatsD also utilizes UDP as its main network protocol, but UDP 
> >>> protocol is simple and does not guarantee the reliable transmission of 
> >>> the payload. Moreover, many monitoring protocols are moving into more 
> >>> modern protocols such as https to send out metrics.
> >>> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not 
> >>> support distributed traces and log ingestion.
> >>>
> >>> Due to the above reasons, I have been looking at opentelemetry 
> >>> (https://github.com/open-telemetry) as a potential replacement for 
> >>> airflow’s current telemetry instrumentation. Opentelemetry is a product 
> >>> of opentracing and open census, and is quickly gaining momentum in terms 
> >>> of ‘standardization’ of means to producing and delivering telemetry data. 
> >>> Not only metrics, but distributed traces, as well as logs. The technology 
> >>> is also geared towards better monitoring cloud-native software. Many 
> >>> monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, 
> >>> Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is 
> >>> designed to be compatible with existing legacy instrumentations. There 
> >>> are also a stable python SDKs and APIs to easily implement it into 
> >>> airflow.
> >>>
> >>> Therefore, I’d like to work on proposing of improving metrics and 
> >>> telemetry capability of airflow by adding configuration and support of 
> >>> open telemetry so that while maintaining the backward compatibility of 
> >>> existing stated based metrics, we would also have an opportunity to have 
> >>> distributed traces and logs to be based on it, so that it would be easier 
> >>> for any Opentelemetry compatible tools to be able to monitor airflow with 
> >>> richer information.
> >>>
> >>> If you were thinking of a need to improve the current metrics 
> >>> capabilities of airflow, and have been thinking of standards like 
> >>> Opentelemetry, please feel free to join the thread and provide any 
> >>> opinions or feedback. I also generally think that we may need to review 
> >>> our current list of metrics and assess whether they are really useful in 
> >>> terms of monitoring and observability of airflow. There are things that 
> >>> we might want to add into metrics such as more executor related metrics, 
> >>> scheduler related metrics, as well as operators and even DB and XCOM 
> >>> related metrics to better assess the health of airflow and make these 
> >>> information helpful for faster troubleshooting and problem resolution.
> >>>
> >>> Thanks and regards,
> >>> Howard
> >>
> >>

Reply via email to