Hi Howard,

We actually have outreachy intern (Melodie) that is working on
researching how open-telemetry can be integrated with Airflow.
Draft PR for demo : https://github.com/apache/airflow/pull/20677
This is an initial effort for a POC.
Maybe you can work together on this?


On Sat, Jan 8, 2022 at 12:19 AM Howard Yoo <howard....@astronomer.io.invalid>
wrote:

> Hi all,
>
> I’m a staff product manager in Astronomer, and wanted to post this email
> according to the guide from
> https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals
>  .
>
> Currently, the main method to publish telemetry data out of airflow is
> through its statsD implementation :
> https://github.com/apache/airflow/blob/main/airflow/stats.py , and
> currently airflow supports two flavors of stated, the original one, and
> data dog’s dogstatsd implementation.
>
> Through this implementation, we have the following list of metrics that
> would be available for other popular monitoring tools to collect, monitor,
> visualize, and alert on metrics generated from airflow:
> https://airflow.apache.org/docs/apache-airflow/stable/logging-monitoring/metrics.html
>
>
> There are a number of limitations of airflow’s current implementation of
> its metrics using stated.
> 1. StatsD is based on simple metrics format that does not support richer
> contexts. Its metric name would contain some of those contexts (such as dag
> id, task id, etc), but those can be limited due to the formatting issue of
> having to be a part of metric name itself. A better approach would be to
> utilizing ‘tags’ to be attached to the metrics data to add more contexts.
> 2. StatsD also utilizes UDP as its main network protocol, but UDP protocol
> is simple and does not guarantee the reliable transmission of the payload.
> Moreover, many monitoring protocols are moving into more modern protocols
> such as https to send out metrics.
> 3. StatsD does support ‘counter,’ ‘gauge,’ and ‘timer,’ but does not
> support distributed traces and log ingestion.
>
> Due to the above reasons, I have been looking at opentelemetry (
> https://github.com/open-telemetry) as a potential replacement for
> airflow’s current telemetry instrumentation. Opentelemetry is a product of
> opentracing and open census, and is quickly gaining momentum in terms of
> ‘standardization’ of means to producing and delivering telemetry data. Not
> only metrics, but distributed traces, as well as logs. The technology is
> also geared towards better monitoring cloud-native software. Many
> monitoring tools vendors are supporting opentelemetry (Tanzu, Datadog,
> Honeycomb, lightstep, etc.) and opentelemetry’s modular architecture is
> designed to be compatible with existing legacy instrumentations. There are
> also a stable python SDKs and APIs to easily implement it into airflow.
>
> Therefore, I’d like to work on proposing of improving metrics and
> telemetry capability of airflow by adding configuration and support of open
> telemetry so that while maintaining the backward compatibility of existing
> stated based metrics, we would also have an opportunity to have distributed
> traces and logs to be based on it, so that it would be easier for any
> Opentelemetry compatible tools to be able to monitor airflow with richer
> information.
>
> If you were thinking of a need to improve the current metrics capabilities
> of airflow, and have been thinking of standards like Opentelemetry, please
> feel free to join the thread and provide any opinions or feedback. I also
> generally think that we may need to review our current list of metrics and
> assess whether they are really useful in terms of monitoring and
> observability of airflow. There are things that we might want to add into
> metrics such as more executor related metrics, scheduler related metrics,
> as well as operators and even DB and XCOM related metrics to better assess
> the health of airflow and make these information helpful for faster
> troubleshooting and problem resolution.
>
> Thanks and regards,
> Howard
>

Reply via email to