Re: [DISCUSS] airflow telemetry : improve with open telemetry

Jarek Potiuk Sun, 09 Jan 2022 21:32:31 -0800

Nick,

Thanks for the comment - yeah. I also tried a ConsoleExporter, however
I think Jaeger in this case was a good choice - of Melodie :). I think
it will be easier to "reason" about some metrics, when we see them in
the Web UI and since Jaeger has a standalone containerized instance
(and we already integrate with it) it might give some extra insights
and basic graphs that will make it much easier to see if things are
working as we expect them (for example it nicely shows some graphs
that can tell us already some basic info about the flask methods
execution times and plots them nicely on a graph).


Part of the POC work was to do some demos to show what information we
can get, and I have a feeling that doing them with a Web UI
(especially that we already integrated with it) will be much more
powerful and demoable.

Also what is important - we do not start from "scratch". We have the
breeze development environment that is based on docker-compose and we
will be also able (long term) to integrate it "nicely" with
`--integration jaeger` flag - similarly as we do with some other
integrations (kerberos, mongo etc.).  We have not done it for now when
we try to establish a "simple" solution but once we do it, we will try
to do it in the way that open-telemetry will be a "pluggable"
component of Airflow - both in production and for development
purposes. So this gives us a chance also to see how we will be able to
make it easy for Airflow developers to add open-telemetry integrations
and (important !) test them easily.

The end goal for that in the POC is to be able to do: `./breeze
start-airflow --integration jaegger` to start airflow and jaegger and
have the OT integration enabled (and disabled when jaeger is not
started as integration). We have a very similar approach for kerberos.
When we run `./breeze start-airflow --integration kerberos` - airflow
starts with kerberos integration enabled and kerberos starts as a
separate image via docker-compose integration. So this fits very well
into the overall "development" environment of Airflow.

Howard,

I am not sure if you know but we have a project here already  outlined
so we know high level what we want to achieve over next 2 months: we
have a few tasks already hashed-out in details, some of them are
drafted as just notes for now:
https://github.com/apache/airflow/projects/14
I think one of the things that you can help us with - is to scope out
and add some details to the cards/issues. I thought the most relevant
are those:

* https://github.com/apache/airflow/projects/14#card-74068217 ("Expand
the POC with Adding Airflow Metrics")
* https://github.com/apache/airflow/projects/14#card-74068317 ("POC of
Monitoring dashboard visualizing the metfrics")
* https://github.com/apache/airflow/projects/14#card-75801333 ("Expand
POC with Open Telemetry for logging integration.)

But I am happy to discuss it offline here, if you have other ideas :)

This week I am traveling quite a lot (mostly private errands) and it
will be difficult for me to arrange something (and Elad is also on
vacation) but I think the following week, we could think about making
a more public demo of the integration (Melodie - what do you think :)
?
This will give Melodie a chance to try other "standard"
instrumentations and we will likely be able to see more of what
"open-telemetry" can give us out-of-the-box, and during the
demo/meeting we could also discuss the scope and ideas for the
follow-up parts.

J.

On Mon, Jan 10, 2022 at 1:20 AM <nick@shook.family> wrote:
>
> This sounds great. I left a small comment about the console-span processor. 
> While I think Jaeger is a great choice for production-dashboards just 
> printing the spans to STDOUT from the airflow server would be a great poc imo 
> b/c it starts the discussion toward structured logging. By having a 
> discussion on structured logging first, everything down the line (dashboards, 
> metrics, slo's etc.) will be much easier.
>
> fwiw, I think I’d like to see log ingestion services parse structured data 
> and create the dashboards on my behalf from just this. I don’t think as an 
> end-user of tools like Datadog I’d be asking for too much.
>
> if there’s a working session on this would love to join, thanks all for this 
> awesome project!
> ---
> nick
>
> On Jan 9, 2022, at 08:35, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Good news - I managed to debug and fix/workaround the flask auto
> instrumentation to work and Melodie should be unblocked.
>
> It was not an easy one to pull - it required a bit of knowledge of how
> airflow webserver works under the hood and finding out that the
> gunicorn's fork model needs a workaround for open-telemetry
> integration.
>
> This makes the need of our open-instrumentation a bit more "complex"
> (but only a bit) and slightly more "exporter-specific" - currently we
> have hard-coded  Jaeger Exporter - but in the future we should be able
> to get it possibly better automated - in the future we might even not
> need to do any workarounds once this one:
> https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171
> (Add integration for Gunicorn) will be implemented in
> open-telemetry-python library (maybe even we can contribute it).
>
> You can see the changes I had to implement here:
> https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded
> and see the comment here for the screenshots from Jaeger:
> https://github.com/apache/airflow/pull/20677#issuecomment-1008327884
>
> We are going to proceed with further integration (hopefully with less
> of a trouble ) of other existing instrumentations now.
>
> Howard, Nick,
>
> I think what might be helpful (and Howards' product manager view might
> be super-helpful) is to define the scope of the integration of the
> "Airflow-specific" telemetry. Defining metrics that we would like to
> have (starting from the current set of metrics) and later propose some
> ways to test it and produce some basica dashboards with some of the
> monitoring tools that we could choose. All as a "Proof-Of-Concept"
> level. so that we can produce some real example and screenshots how
> the open-telemetry integration might  work and what value it might
> bring.
>
> The end goal of the internship of Melody is to prepare an Airflow
> Improvement Proposal where we could - base on our learning from the
> internship propose how the integration would look like.
>
> WDYT ?
>
> J.
>
> On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>
>
> Yep. Absolutely. We are at the stage now (and this is something we are
> looking at (and I have planned to this weekend) is to see why
> auto-instrumentation of the open-telemetry in the PR of Melody's PR
> does not seem to auto-instrument our Flask integration (we chose flask
> as the first integration that should be "easy" but for whatever reason
> auto-instrumetnation - even in the `--debug` mode of airflow - does
> not seem to work despite everything seemingly "correct".
>
> I plan to take a look today at it and we can discuss it in Melody's
> PR. That would be fantastic if we could work on it together  :).
>
> J.
>
> On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <fluxi...@gmail.com> wrote:
>
>
> Hi nick,
>
> You can look at the PR or clone my Fork and try running in your local 
> environment and see if there’s any way we can improve on the 
> auto-instrumention
> Would love to get a feedback.
> Thank you
>
> On Sat, 8 Jan 2022 at 12:19 AM, <nick@shook.family> wrote:
>
>
> hi all, been lurking for a while - this is my first post.
>
> what I like about open telemetry is that you can send all telemetry traces to 
> STDOUT (or any logs) which you can then pipe to many log forwarders of 
> choice. imo this is the easiest way to set it up and a default that should 
> work in the vast majority of airflow use cases.
>
> the PR looks like a great start! what can I do to help?
> ---
> nick
>
> On Jan 7, 2022, at 14:37, Elad Kalif <elad...@apache.org> wrote:
>
> Hi Howard,
>
> We actually have outreachy intern (Melodie) that is working on researching 
> how open-telemetry can be integrated with Airflow.
> Draft PR for demo : https://github.com/apache/airflow/pull/20677
> This is an initial effort for a POC.
> Maybe you can work together on this?
>
>
> On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo <howard....@astronomer.io.invalid> 
> wrote:
>
>
> Hi all,
>
> I’m a staff product manager in Astronomer, and wanted to post this email 
> according to the guide from 
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
>  .
>
> Currently, the main method to publish telemetry data out of airflow is 
> through its statsD implementation : 
> https://github.com/apache/airflow/blob/main/airflow/stats.py , and currently 
> airflow supports two flavors of stated, the original one, and data dog’s 
> dogstatsd implementation.
>
> Through this implementation, we have the following list of metrics that would 
> be available for other popular monitoring tools to collect, monitor, 
> visualize, and alert on metrics generated from airflow: 
> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
>
> There are a number of limitations of airflow’s current implementation of its 
> metrics using stated.
> 1. StatsD is based on simple metrics format that does not support richer 
> contexts. Its metric name would contain some of those contexts (such as dag 
> id, task id, etc), but those can be limited due to the formatting issue of 
> having to be a part of metric name itself. A better approach would be to 
> utilizing ‘tags’ to be attached to the metrics data to add more contexts.
> 2. StatsD also utilizes UDP as its main network protocol, but UDP protocol is 
> simple and does not guarantee the reliable transmission of the payload. 
> Moreover, many monitoring protocols are moving into more modern protocols 
> such as https to send out metrics.
> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not support 
> distributed traces and log ingestion.
>
> Due to the above reasons, I have been looking at opentelemetry 
> (https://github.com/open-telemetry) as a potential replacement for airflow’s 
> current telemetry instrumentation. Opentelemetry is a product of opentracing 
> and open census, and is quickly gaining momentum in terms of 
> ‘standardization’ of means to producing and delivering telemetry data. Not 
> only metrics, but distributed traces, as well as logs. The technology is also 
> geared towards better monitoring cloud-native software. Many monitoring tools 
> vendors are supporting opentelemetry (Tanzu, Datadog, Honeycomb, lightstep, 
> etc.) and opentelemetry’s modular architecture is designed to be compatible 
> with existing legacy instrumentations. There are also a stable python SDKs 
> and APIs to easily implement it into airflow.
>
> Therefore, I’d like to work on proposing of improving metrics and telemetry 
> capability of airflow by adding configuration and support of open telemetry 
> so that while maintaining the backward compatibility of existing stated based 
> metrics, we would also have an opportunity to have distributed traces and 
> logs to be based on it, so that it would be easier for any Opentelemetry 
> compatible tools to be able to monitor airflow with richer information.
>
> If you were thinking of a need to improve the current metrics capabilities of 
> airflow, and have been thinking of standards like Opentelemetry, please feel 
> free to join the thread and provide any opinions or feedback. I also 
> generally think that we may need to review our current list of metrics and 
> assess whether they are really useful in terms of monitoring and 
> observability of airflow. There are things that we might want to add into 
> metrics such as more executor related metrics, scheduler related metrics, as 
> well as operators and even DB and XCOM related metrics to better assess the 
> health of airflow and make these information helpful for faster 
> troubleshooting and problem resolution.
>
> Thanks and regards,
> Howard
>
>
>
>

Re: [DISCUSS] airflow telemetry : improve with open telemetry

Reply via email to