Hi Jarek,

Yes I also think making a more public demo of the integration is a good
idea.

On Mon, 10 Jan 2022 at 6:32 AM, Jarek Potiuk <ja...@potiuk.com> wrote:

> Nick,
>
> Thanks for the comment - yeah. I also tried a ConsoleExporter, however
> I think Jaeger in this case was a good choice - of Melodie :). I think
> it will be easier to "reason" about some metrics, when we see them in
> the Web UI and since Jaeger has a standalone containerized instance
> (and we already integrate with it) it might give some extra insights
> and basic graphs that will make it much easier to see if things are
> working as we expect them (for example it nicely shows some graphs
> that can tell us already some basic info about the flask methods
> execution times and plots them nicely on a graph).
>
> Part of the POC work was to do some demos to show what information we
> can get, and I have a feeling that doing them with a Web UI
> (especially that we already integrated with it) will be much more
> powerful and demoable.
>
> Also what is important - we do not start from "scratch". We have the
> breeze development environment that is based on docker-compose and we
> will be also able (long term) to integrate it "nicely" with
> `--integration jaeger` flag - similarly as we do with some other
> integrations (kerberos, mongo etc.).  We have not done it for now when
> we try to establish a "simple" solution but once we do it, we will try
> to do it in the way that open-telemetry will be a "pluggable"
> component of Airflow - both in production and for development
> purposes. So this gives us a chance also to see how we will be able to
> make it easy for Airflow developers to add open-telemetry integrations
> and (important !) test them easily.
>
> The end goal for that in the POC is to be able to do: `./breeze
> start-airflow --integration jaegger` to start airflow and jaegger and
> have the OT integration enabled (and disabled when jaeger is not
> started as integration). We have a very similar approach for kerberos.
> When we run `./breeze start-airflow --integration kerberos` - airflow
> starts with kerberos integration enabled and kerberos starts as a
> separate image via docker-compose integration. So this fits very well
> into the overall "development" environment of Airflow.
>
> Howard,
>
> I am not sure if you know but we have a project here already  outlined
> so we know high level what we want to achieve over next 2 months: we
> have a few tasks already hashed-out in details, some of them are
> drafted as just notes for now:
> https://github.com/apache/airflow/projects/14
> I think one of the things that you can help us with - is to scope out
> and add some details to the cards/issues. I thought the most relevant
> are those:
>
> * https://github.com/apache/airflow/projects/14#card-74068217 ("Expand
> the POC with Adding Airflow Metrics")
> * https://github.com/apache/airflow/projects/14#card-74068317 ("POC of
> Monitoring dashboard visualizing the metfrics")
> * https://github.com/apache/airflow/projects/14#card-75801333 ("Expand
> POC with Open Telemetry for logging integration.)
>
> But I am happy to discuss it offline here, if you have other ideas :)
>
> This week I am traveling quite a lot (mostly private errands) and it
> will be difficult for me to arrange something (and Elad is also on
> vacation) but I think the following week, we could think about making
> a more public demo of the integration (Melodie - what do you think :)
> ?
> This will give Melodie a chance to try other "standard"
> instrumentations and we will likely be able to see more of what
> "open-telemetry" can give us out-of-the-box, and during the
> demo/meeting we could also discuss the scope and ideas for the
> follow-up parts.
>
> J.
>
> On Mon, Jan 10, 2022 at 1:20 AM <nick@shook.family> wrote:
> >
> > This sounds great. I left a small comment about the console-span
> processor. While I think Jaeger is a great choice for production-dashboards
> just printing the spans to STDOUT from the airflow server would be a great
> poc imo b/c it starts the discussion toward structured logging. By having a
> discussion on structured logging first, everything down the line
> (dashboards, metrics, slo's etc.) will be much easier.
> >
> > fwiw, I think I’d like to see log ingestion services parse structured
> data and create the dashboards on my behalf from just this. I don’t think
> as an end-user of tools like Datadog I’d be asking for too much.
> >
> > if there’s a working session on this would love to join, thanks all for
> this awesome project!
> > ---
> > nick
> >
> > On Jan 9, 2022, at 08:35, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Good news - I managed to debug and fix/workaround the flask auto
> > instrumentation to work and Melodie should be unblocked.
> >
> > It was not an easy one to pull - it required a bit of knowledge of how
> > airflow webserver works under the hood and finding out that the
> > gunicorn's fork model needs a workaround for open-telemetry
> > integration.
> >
> > This makes the need of our open-instrumentation a bit more "complex"
> > (but only a bit) and slightly more "exporter-specific" - currently we
> > have hard-coded  Jaeger Exporter - but in the future we should be able
> > to get it possibly better automated - in the future we might even not
> > need to do any workarounds once this one:
> >
> https://github.com/open-telemetry/opentelemetry-python-contrib/issues/171
> > (Add integration for Gunicorn) will be implemented in
> > open-telemetry-python library (maybe even we can contribute it).
> >
> > You can see the changes I had to implement here:
> >
> https://github.com/apache/airflow/pull/20677/commits/31f5fce3ff456dfffcd5814db9504094a0448ded
> > and see the comment here for the screenshots from Jaeger:
> > https://github.com/apache/airflow/pull/20677#issuecomment-1008327884
> >
> > We are going to proceed with further integration (hopefully with less
> > of a trouble ) of other existing instrumentations now.
> >
> > Howard, Nick,
> >
> > I think what might be helpful (and Howards' product manager view might
> > be super-helpful) is to define the scope of the integration of the
> > "Airflow-specific" telemetry. Defining metrics that we would like to
> > have (starting from the current set of metrics) and later propose some
> > ways to test it and produce some basica dashboards with some of the
> > monitoring tools that we could choose. All as a "Proof-Of-Concept"
> > level. so that we can produce some real example and screenshots how
> > the open-telemetry integration might  work and what value it might
> > bring.
> >
> > The end goal of the internship of Melody is to prepare an Airflow
> > Improvement Proposal where we could - base on our learning from the
> > internship propose how the integration would look like.
> >
> > WDYT ?
> >
> > J.
> >
> > On Sat, Jan 8, 2022 at 12:20 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> >
> > Yep. Absolutely. We are at the stage now (and this is something we are
> > looking at (and I have planned to this weekend) is to see why
> > auto-instrumentation of the open-telemetry in the PR of Melody's PR
> > does not seem to auto-instrument our Flask integration (we chose flask
> > as the first integration that should be "easy" but for whatever reason
> > auto-instrumetnation - even in the `--debug` mode of airflow - does
> > not seem to work despite everything seemingly "correct".
> >
> > I plan to take a look today at it and we can discuss it in Melody's
> > PR. That would be fantastic if we could work on it together  :).
> >
> > J.
> >
> > On Sat, Jan 8, 2022 at 12:09 PM melodie ezeani <fluxi...@gmail.com>
> wrote:
> >
> >
> > Hi nick,
> >
> > You can look at the PR or clone my Fork and try running in your local
> environment and see if there’s any way we can improve on the
> auto-instrumention
> > Would love to get a feedback.
> > Thank you
> >
> > On Sat, 8 Jan 2022 at 12:19 AM, <nick@shook.family> wrote:
> >
> >
> > hi all, been lurking for a while - this is my first post.
> >
> > what I like about open telemetry is that you can send all telemetry
> traces to STDOUT (or any logs) which you can then pipe to many log
> forwarders of choice. imo this is the easiest way to set it up and a
> default that should work in the vast majority of airflow use cases.
> >
> > the PR looks like a great start! what can I do to help?
> > ---
> > nick
> >
> > On Jan 7, 2022, at 14:37, Elad Kalif <elad...@apache.org> wrote:
> >
> > Hi Howard,
> >
> > We actually have outreachy intern (Melodie) that is working on
> researching how open-telemetry can be integrated with Airflow.
> > Draft PR for demo : https://github.com/apache/airflow/pull/20677
> > This is an initial effort for a POC.
> > Maybe you can work together on this?
> >
> >
> > On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo 
> > <howard....@astronomer.io.invalid>
> wrote:
> >
> >
> > Hi all,
> >
> > I’m a staff product manager in Astronomer, and wanted to post this email
> according to the guide from
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
> .
> >
> > Currently, the main method to publish telemetry data out of airflow is
> through its statsD implementation :
> https://github.com/apache/airflow/blob/main/airflow/stats.py , and
> currently airflow supports two flavors of stated, the original one, and
> data dog’s dogstatsd implementation.
> >
> > Through this implementation, we have the following list of metrics that
> would be available for other popular monitoring tools to collect, monitor,
> visualize, and alert on metrics generated from airflow:
> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
> >
> > There are a number of limitations of airflow’s current implementation of
> its metrics using stated.
> > 1. StatsD is based on simple metrics format that does not support richer
> contexts. Its metric name would contain some of those contexts (such as dag
> id, task id, etc), but those can be limited due to the formatting issue of
> having to be a part of metric name itself. A better approach would be to
> utilizing ‘tags’ to be attached to the metrics data to add more contexts.
> > 2. StatsD also utilizes UDP as its main network protocol, but UDP
> protocol is simple and does not guarantee the reliable transmission of the
> payload. Moreover, many monitoring protocols are moving into more modern
> protocols such as https to send out metrics.
> > 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not
> support distributed traces and log ingestion.
> >
> > Due to the above reasons, I have been looking at opentelemetry (
> https://github.com/open-telemetry) as a potential replacement for
> airflow’s current telemetry instrumentation. Opentelemetry is a product of
> opentracing and open census, and is quickly gaining momentum in terms of
> ‘standardization’ of means to producing and delivering telemetry data. Not
> only metrics, but distributed traces, as well as logs. The technology is
> also geared towards better monitoring cloud-native software. Many
> monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog,
> Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is
> designed to be compatible with existing legacy instrumentations. There are
> also a stable python SDKs and APIs to easily implement it into airflow.
> >
> > Therefore, I’d like to work on proposing of improving metrics and
> telemetry capability of airflow by adding configuration and support of open
> telemetry so that while maintaining the backward compatibility of existing
> stated based metrics, we would also have an opportunity to have distributed
> traces and logs to be based on it, so that it would be easier for any
> Opentelemetry compatible tools to be able to monitor airflow with richer
> information.
> >
> > If you were thinking of a need to improve the current metrics
> capabilities of airflow, and have been thinking of standards like
> Opentelemetry, please feel free to join the thread and provide any opinions
> or feedback. I also generally think that we may need to review our current
> list of metrics and assess whether they are really useful in terms of
> monitoring and observability of airflow. There are things that we might
> want to add into metrics such as more executor related metrics, scheduler
> related metrics, as well as operators and even DB and XCOM related metrics
> to better assess the health of airflow and make these information helpful
> for faster troubleshooting and problem resolution.
> >
> > Thanks and regards,
> > Howard
> >
> >
> >
> >
>

Reply via email to