Hey Jarek,

Just checked this thread today, and was happy to see that there were some work 
already being done.
I would very much support the idea of Melody to prepare an airflow improvement 
proposal.

Also would like to see if I can participate and help you folks of this work in 
anyway I can.
I was also drafting some AIP for open telemetry, but I would think working 
together would be way better.

Also, I do have a feedback that current metrics list and what they track are 
not really that useful
(I mean, there is so much that one can do for metrics like operator failures 
and ti failures - since they don’t post any context specific information) - so 
while we may be working with making OpenTelemetry available for airflow, we 
might also investigate and try improvements on reviewing these metrics and 
really verify whether these metrics are helpful, and if there can be additional 
metrics that we can instrument while doing this.

I think when we are designing for the distributed traces on Airflow, we should 
also work on defining what kind of traces would be useful and how to come up 
with better name convention etc. to make things clear and easy to understand, 
etc..

- Howard

On 2022/01/09 16:35:48 Jarek Potiuk wrote:
> Good news - I managed to debug and fix/workaround the flask auto
> instrumentation to work and Melodie should be unblocked.
> 
> It was not an easy one to pull - it required a bit of knowledge of how
> airflow webserver works under the hood and finding out that the
> gunicorn's fork model needs a workaround for open-telemetry
> integration.
> 
> This makes the need of our open-instrumentation a bit more "complex"
> (but only a bit) and slightly more "exporter-specific" - currently we
> have hard-coded  Jaeger Exporter - but in the future we should be able
> to get it possibly better automated - in the future we might even not
> need to do any workarounds once this one:
> https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171
> (Add integration for Gunicorn) will be implemented in
> open-telemetry-python library (maybe even we can contribute it).
> 
> You can see the changes I had to implement here:
> https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded
> and see the comment here for the screenshots from Jaeger:
> https://github.com/apache/airflow/pull/20677#issuecomment-1008327884
> 
> We are going to proceed with further integration (hopefully with less
> of a trouble ) of other existing instrumentations now.
> 
> Howard, Nick,
> 
> I think what might be helpful (and Howards' product manager view might
> be super-helpful) is to define the scope of the integration of the
> "Airflow-specific" telemetry. Defining metrics that we would like to
> have (starting from the current set of metrics) and later propose some
> ways to test it and produce some basica dashboards with some of the
> monitoring tools that we could choose. All as a "Proof-Of-Concept"
> level. so that we can produce some real example and screenshots how
> the open-telemetry integration might  work and what value it might
> bring.
> 
> The end goal of the internship of Melody is to prepare an Airflow
> Improvement Proposal where we could - base on our learning from the
> internship propose how the integration would look like.
> 
> WDYT ?
> 
> J.
> 
> On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Yep. Absolutely. We are at the stage now (and this is something we are
> > looking at (and I have planned to this weekend) is to see why
> > auto-instrumentation of the open-telemetry in the PR of Melody's PR
> > does not seem to auto-instrument our Flask integration (we chose flask
> > as the first integration that should be "easy" but for whatever reason
> > auto-instrumetnation - even in the `--debug` mode of airflow - does
> > not seem to work despite everything seemingly "correct".
> >
> > I plan to take a look today at it and we can discuss it in Melody's
> > PR. That would be fantastic if we could work on it together  :).
> >
> > J.
> >
> > On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <fl...@gmail.com> wrote:
> > >
> > > Hi nick,
> > >
> > > You can look at the PR or clone my Fork and try running in your local 
> > > environment and see if there’s any way we can improve on the 
> > > auto-instrumention
> > > Would love to get a feedback.
> > > Thank you
> > >
> > > On Sat, 8 Jan 2022 at 12:19 AM, <ni...@shook.family> wrote:
> > >>
> > >> hi all, been lurking for a while - this is my first post.
> > >>
> > >> what I like about open telemetry is that you can send all telemetry 
> > >> traces to STDOUT (or any logs) which you can then pipe to many log 
> > >> forwarders of choice. imo this is the easiest way to set it up and a 
> > >> default that should work in the vast majority of airflow use cases.
> > >>
> > >> the PR looks like a great start! what can I do to help?
> > >> ---
> > >> nick
> > >>
> > >> On Jan 7, 2022, at 14:37, Elad Kalif <el...@apache.org> wrote:
> > >>
> > >> Hi Howard,
> > >>
> > >> We actually have outreachy intern (Melodie) that is working on 
> > >> researching how open-telemetry can be integrated with Airflow.
> > >> Draft PR for demo : https://github.com/apache/airflow/pull/20677
> > >> This is an initial effort for a POC.
> > >> Maybe you can work together on this?
> > >>
> > >>
> > >> On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo <ho...@astronomer.io.invalid> 
> > >> wrote:
> > >>>
> > >>> Hi all,
> > >>>
> > >>> I’m a staff product manager in Astronomer, and wanted to post this 
> > >>> email according to the guide from 
> > >>> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
> > >>>  .
> > >>>
> > >>> Currently, the main method to publish telemetry data out of airflow is 
> > >>> through its statsD implementation : 
> > >>> https://github.com/apache/airflow/blob/main/airflow/stats.py , and 
> > >>> currently airflow supports two flavors of stated, the original one, and 
> > >>> data dog’s dogstatsd implementation.
> > >>>
> > >>> Through this implementation, we have the following list of metrics that 
> > >>> would be available for other popular monitoring tools to collect, 
> > >>> monitor, visualize, and alert on metrics generated from airflow: 
> > >>> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
> > >>>
> > >>> There are a number of limitations of airflow’s current implementation 
> > >>> of its metrics using stated.
> > >>> 1. StatsD is based on simple metrics format that does not support 
> > >>> richer contexts. Its metric name would contain some of those contexts 
> > >>> (such as dag id, task id, etc), but those can be limited due to the 
> > >>> formatting issue of having to be a part of metric name itself. A better 
> > >>> approach would be to utilizing ‘tags’ to be attached to the metrics 
> > >>> data to add more contexts.
> > >>> 2. StatsD also utilizes UDP as its main network protocol, but UDP 
> > >>> protocol is simple and does not guarantee the reliable transmission of 
> > >>> the payload. Moreover, many monitoring protocols are moving into more 
> > >>> modern protocols such as https to send out metrics.
> > >>> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not 
> > >>> support distributed traces and log ingestion.
> > >>>
> > >>> Due to the above reasons, I have been looking at opentelemetry 
> > >>> (https://github.com/open-telemetry) as a potential replacement for 
> > >>> airflow’s current telemetry instrumentation. Opentelemetry is a product 
> > >>> of opentracing and open census, and is quickly gaining momentum in 
> > >>> terms of ‘standardization’ of means to producing and delivering 
> > >>> telemetry data. Not only metrics, but distributed traces, as well as 
> > >>> logs. The technology is also geared towards better monitoring 
> > >>> cloud-native software. Many monitoring tools vendors are supporting 
> > >>> opentelemetry (Tanzu, Datadog, Honeycomb, lightstep, etc.) and 
> > >>> opentelemetry’s modular architecture is designed to be compatible with 
> > >>> existing legacy instrumentations. There are also a stable python SDKs 
> > >>> and APIs to easily implement it into airflow.
> > >>>
> > >>> Therefore, I’d like to work on proposing of improving metrics and 
> > >>> telemetry capability of airflow by adding configuration and support of 
> > >>> open telemetry so that while maintaining the backward compatibility of 
> > >>> existing stated based metrics, we would also have an opportunity to 
> > >>> have distributed traces and logs to be based on it, so that it would be 
> > >>> easier for any Opentelemetry compatible tools to be able to monitor 
> > >>> airflow with richer information.
> > >>>
> > >>> If you were thinking of a need to improve the current metrics 
> > >>> capabilities of airflow, and have been thinking of standards like 
> > >>> Opentelemetry, please feel free to join the thread and provide any 
> > >>> opinions or feedback. I also generally think that we may need to review 
> > >>> our current list of metrics and assess whether they are really useful 
> > >>> in terms of monitoring and observability of airflow. There are things 
> > >>> that we might want to add into metrics such as more executor related 
> > >>> metrics, scheduler related metrics, as well as operators and even DB 
> > >>> and XCOM related metrics to better assess the health of airflow and 
> > >>> make these information helpful for faster troubleshooting and problem 
> > >>> resolution.
> > >>>
> > >>> Thanks and regards,
> > >>> Howard
> > >>
> > >>
> 

Reply via email to