Hi Jarek, Yes I also think making a more public demo of the integration is a good idea.
On Mon, 10 Jan 2022 at 6:32 AM, Jarek Potiuk <ja...@potiuk.com> wrote: > Nick, > > Thanks for the comment - yeah. I also tried a ConsoleExporter, however > I think Jaeger in this case was a good choice - of Melodie :). I think > it will be easier to "reason" about some metrics, when we see them in > the Web UI and since Jaeger has a standalone containerized instance > (and we already integrate with it) it might give some extra insights > and basic graphs that will make it much easier to see if things are > working as we expect them (for example it nicely shows some graphs > that can tell us already some basic info about the flask methods > execution times and plots them nicely on a graph). > > Part of the POC work was to do some demos to show what information we > can get, and I have a feeling that doing them with a Web UI > (especially that we already integrated with it) will be much more > powerful and demoable. > > Also what is important - we do not start from "scratch". We have the > breeze development environment that is based on docker-compose and we > will be also able (long term) to integrate it "nicely" with > `--integration jaeger` flag - similarly as we do with some other > integrations (kerberos, mongo etc.). We have not done it for now when > we try to establish a "simple" solution but once we do it, we will try > to do it in the way that open-telemetry will be a "pluggable" > component of Airflow - both in production and for development > purposes. So this gives us a chance also to see how we will be able to > make it easy for Airflow developers to add open-telemetry integrations > and (important !) test them easily. > > The end goal for that in the POC is to be able to do: `./breeze > start-airflow --integration jaegger` to start airflow and jaegger and > have the OT integration enabled (and disabled when jaeger is not > started as integration). We have a very similar approach for kerberos. > When we run `./breeze start-airflow --integration kerberos` - airflow > starts with kerberos integration enabled and kerberos starts as a > separate image via docker-compose integration. So this fits very well > into the overall "development" environment of Airflow. > > Howard, > > I am not sure if you know but we have a project here already outlined > so we know high level what we want to achieve over next 2 months: we > have a few tasks already hashed-out in details, some of them are > drafted as just notes for now: > https://github.com/apache/airflow/projects/14 > I think one of the things that you can help us with - is to scope out > and add some details to the cards/issues. I thought the most relevant > are those: > > * https://github.com/apache/airflow/projects/14#card-74068217 ("Expand > the POC with Adding Airflow Metrics") > * https://github.com/apache/airflow/projects/14#card-74068317 ("POC of > Monitoring dashboard visualizing the metfrics") > * https://github.com/apache/airflow/projects/14#card-75801333 ("Expand > POC with Open Telemetry for logging integration.) > > But I am happy to discuss it offline here, if you have other ideas :) > > This week I am traveling quite a lot (mostly private errands) and it > will be difficult for me to arrange something (and Elad is also on > vacation) but I think the following week, we could think about making > a more public demo of the integration (Melodie - what do you think :) > ? > This will give Melodie a chance to try other "standard" > instrumentations and we will likely be able to see more of what > "open-telemetry" can give us out-of-the-box, and during the > demo/meeting we could also discuss the scope and ideas for the > follow-up parts. > > J. > > On Mon, Jan 10, 2022 at 1:20 AM <nick@shook.family> wrote: > > > > This sounds great. I left a small comment about the console-span > processor. While I think Jaeger is a great choice for production-dashboards > just printing the spans to STDOUT from the airflow server would be a great > poc imo b/c it starts the discussion toward structured logging. By having a > discussion on structured logging first, everything down the line > (dashboards, metrics, slo's etc.) will be much easier. > > > > fwiw, I think I’d like to see log ingestion services parse structured > data and create the dashboards on my behalf from just this. I don’t think > as an end-user of tools like Datadog I’d be asking for too much. > > > > if there’s a working session on this would love to join, thanks all for > this awesome project! > > --- > > nick > > > > On Jan 9, 2022, at 08:35, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > Good news - I managed to debug and fix/workaround the flask auto > > instrumentation to work and Melodie should be unblocked. > > > > It was not an easy one to pull - it required a bit of knowledge of how > > airflow webserver works under the hood and finding out that the > > gunicorn's fork model needs a workaround for open-telemetry > > integration. > > > > This makes the need of our open-instrumentation a bit more "complex" > > (but only a bit) and slightly more "exporter-specific" - currently we > > have hard-coded Jaeger Exporter - but in the future we should be able > > to get it possibly better automated - in the future we might even not > > need to do any workarounds once this one: > > > https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171 > > (Add integration for Gunicorn) will be implemented in > > open-telemetry-python library (maybe even we can contribute it). > > > > You can see the changes I had to implement here: > > > https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded > > and see the comment here for the screenshots from Jaeger: > > https://github.com/apache/airflow/pull/20677#issuecomment-1008327884 > > > > We are going to proceed with further integration (hopefully with less > > of a trouble ) of other existing instrumentations now. > > > > Howard, Nick, > > > > I think what might be helpful (and Howards' product manager view might > > be super-helpful) is to define the scope of the integration of the > > "Airflow-specific" telemetry. Defining metrics that we would like to > > have (starting from the current set of metrics) and later propose some > > ways to test it and produce some basica dashboards with some of the > > monitoring tools that we could choose. All as a "Proof-Of-Concept" > > level. so that we can produce some real example and screenshots how > > the open-telemetry integration might work and what value it might > > bring. > > > > The end goal of the internship of Melody is to prepare an Airflow > > Improvement Proposal where we could - base on our learning from the > > internship propose how the integration would look like. > > > > WDYT ? > > > > J. > > > > On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > Yep. Absolutely. We are at the stage now (and this is something we are > > looking at (and I have planned to this weekend) is to see why > > auto-instrumentation of the open-telemetry in the PR of Melody's PR > > does not seem to auto-instrument our Flask integration (we chose flask > > as the first integration that should be "easy" but for whatever reason > > auto-instrumetnation - even in the `--debug` mode of airflow - does > > not seem to work despite everything seemingly "correct". > > > > I plan to take a look today at it and we can discuss it in Melody's > > PR. That would be fantastic if we could work on it together :). > > > > J. > > > > On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <fluxi...@gmail.com> > wrote: > > > > > > Hi nick, > > > > You can look at the PR or clone my Fork and try running in your local > environment and see if there’s any way we can improve on the > auto-instrumention > > Would love to get a feedback. > > Thank you > > > > On Sat, 8 Jan 2022 at 12:19 AM, <nick@shook.family> wrote: > > > > > > hi all, been lurking for a while - this is my first post. > > > > what I like about open telemetry is that you can send all telemetry > traces to STDOUT (or any logs) which you can then pipe to many log > forwarders of choice. imo this is the easiest way to set it up and a > default that should work in the vast majority of airflow use cases. > > > > the PR looks like a great start! what can I do to help? > > --- > > nick > > > > On Jan 7, 2022, at 14:37, Elad Kalif <elad...@apache.org> wrote: > > > > Hi Howard, > > > > We actually have outreachy intern (Melodie) that is working on > researching how open-telemetry can be integrated with Airflow. > > Draft PR for demo : https://github.com/apache/airflow/pull/20677 > > This is an initial effort for a POC. > > Maybe you can work together on this? > > > > > > On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo > > <howard....@astronomer.io.invalid> > wrote: > > > > > > Hi all, > > > > I’m a staff product manager in Astronomer, and wanted to post this email > according to the guide from > https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals > . > > > > Currently, the main method to publish telemetry data out of airflow is > through its statsD implementation : > https://github.com/apache/airflow/blob/main/airflow/stats.py , and > currently airflow supports two flavors of stated, the original one, and > data dog’s dogstatsd implementation. > > > > Through this implementation, we have the following list of metrics that > would be available for other popular monitoring tools to collect, monitor, > visualize, and alert on metrics generated from airflow: > https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html > > > > There are a number of limitations of airflow’s current implementation of > its metrics using stated. > > 1. StatsD is based on simple metrics format that does not support richer > contexts. Its metric name would contain some of those contexts (such as dag > id, task id, etc), but those can be limited due to the formatting issue of > having to be a part of metric name itself. A better approach would be to > utilizing ‘tags’ to be attached to the metrics data to add more contexts. > > 2. StatsD also utilizes UDP as its main network protocol, but UDP > protocol is simple and does not guarantee the reliable transmission of the > payload. Moreover, many monitoring protocols are moving into more modern > protocols such as https to send out metrics. > > 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not > support distributed traces and log ingestion. > > > > Due to the above reasons, I have been looking at opentelemetry ( > https://github.com/open-telemetry) as a potential replacement for > airflow’s current telemetry instrumentation. Opentelemetry is a product of > opentracing and open census, and is quickly gaining momentum in terms of > ‘standardization’ of means to producing and delivering telemetry data. Not > only metrics, but distributed traces, as well as logs. The technology is > also geared towards better monitoring cloud-native software. Many > monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog, > Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is > designed to be compatible with existing legacy instrumentations. There are > also a stable python SDKs and APIs to easily implement it into airflow. > > > > Therefore, I’d like to work on proposing of improving metrics and > telemetry capability of airflow by adding configuration and support of open > telemetry so that while maintaining the backward compatibility of existing > stated based metrics, we would also have an opportunity to have distributed > traces and logs to be based on it, so that it would be easier for any > Opentelemetry compatible tools to be able to monitor airflow with richer > information. > > > > If you were thinking of a need to improve the current metrics > capabilities of airflow, and have been thinking of standards like > Opentelemetry, please feel free to join the thread and provide any opinions > or feedback. I also generally think that we may need to review our current > list of metrics and assess whether they are really useful in terms of > monitoring and observability of airflow. There are things that we might > want to add into metrics such as more executor related metrics, scheduler > related metrics, as well as operators and even DB and XCOM related metrics > to better assess the health of airflow and make these information helpful > for faster troubleshooting and problem resolution. > > > > Thanks and regards, > > Howard > > > > > > > > >