potiuk commented on issue #16706:
URL: https://github.com/apache/airflow/issues/16706#issuecomment-909732961


   Hmm. I spent a bit of time reading, The title suggested that you wanted to 
focus on "Hooks" and I think that made me think that you want to focus on 
instrumenting particular hooks (each hook separately), but I have a different 
understanding now - please correct me if I am wrong or right or maybe needs 
some correction :).
   
   Let me rephrase it all and describe in my own words how i understand what 
should be done. I might be repeating what you wanted to expressed before, just 
want to make sure that we understand it the same way.
   
   I think I misunderstood, when first reading your answer, the 'contribute  
opentelemetry-instrumentation-X'  package.
   
   Did you mean contribute an `opentelmentry-instrumentation-airflow` ? 
   Or did you mean just add dependencies to the existing instrumentation in 
airflow's setup.py and get some way of enabling it? 
   
   If that's the former - Airflow does not feel like good candidate to have 
`opentelemetry-instrumentation-airflow`. Hardly anyone would use airflow as a 
library in the sense that some other code would like to instrument airflow. So 
I assume you mean the latter. I work with the assumption that it's this..
   
   1) Integrating particular libraries
   
   I think that it would be nice to  add  (psycopg2, mysql/pymsql, celery, 
boto, botocore, flask, sqlalchemy, httpx, urllib3, redis, rquests, sqlite3 
wsgi) - those could give low-level traceable data that might be really useful 
to start with. Necessarily with some way of enabling and configuring it - for 
example someone might want to disable tracing for postgres - because it might 
hit the performance if instrumentation is enabled for all calls, or maybe 
selectively enable cursor tracking. Basically we need to figure out for each of 
the libraries what and how should be configured - likely in airflow.cfg as a 
new [opentelemetry] section. And we need to figure out how to add those 
packages (ideally those should be `extras: of `pip install 
apache-airflow[opentelemetry-instrumentation-psycopg2]` should install the 
psycpg2 instrumentation). Possibly even instrumentation of particular library 
should be enabled by default when a given package is installed (to make things 
easier) with the cap
 ability of disabling it selectively. I think those instrumentations should be 
done one-by-one, there is really no way to make a general dynamic 
instrumentation (but it's rather easy to do it one-by-one). Extras would be 
useful to be able to generate "golden" set of constraints for different 
versions of those telemetry libraries supported by Airflow (same as provider's 
extras). One more useful thing her would beto be able to disable/enable such 
tracing selectively WHILE the application is running, but that might be a bit 
more complex (but long term doable I believe).
   
   [HERE: Documentaiton Update needed on how to install and enable telemetry 
for the librarires] 
   
   So far, so good. This could give us low-level data collected. Now what to do 
with it.
   
   2) Exporting the data
   
   We need a way to extensibly define exporters. Similarly to how we define 
Secret Backends - we should be able to define ways of how to 
register/instantiate exporter(s) and configure them (again in airflow.cfg). 
Sounds like fairly-easily doable. Worst case we can implement thin wrappers 
around the exporters to map configuration from some dictionary form to actually 
constructing the exporters and instantiating them. There are some things to 
solve for PUSH vs. PULL approach (authentication ? how to make sure the ports 
are exposed in a deployment? How to nicely embed those networking pieces in the 
Helm Chart of ours?
   
   [HERE Documentaiton will be needed on how to configure sevearl popular 
exporters - Prometheus/DataDog etc. etc.]
   
   That sounds like we can have usable data and ability to view it here already.
   
   3) Propagating context and correlation.
   
   From what I read - spans can allow us to corelate events generated by those 
different events. Seems like we could define them for different "parts" of the 
system. Some spans can be defined for just "executing task" some for "parsing" 
some for "scheduling" parts. I think all of them need a treatment (not only 
hooks) as those parts are related and can interfere with each other. There will 
be some difficulties there to solve - we are using mutli-processing and forking 
heavily in parts of the system, celery workers work continuously for different 
tasks (so different spans) in CeleryExecutor while K8S executor starts new pod 
always per task etc. etc. Implicit Propagation will likely work in some cases 
but it might require some effort to make it complete. That will be likely most 
complex part of it (but it also can be done in stages). I think we need at most 
"Task" correlation (all event belonging to particular task). Not sure if 
possible to add DAG span on top of the Task as "hierarchy
 ". Might be useful.
   
   [Here: we need to document spans, attributes they have[
   
   Now the really useful part:
   
   4) Metrics 
   
   I understand that each of the libraries instrumented will already provide 
their metrics (so urrllib3 will provide the "bytes transferred" for one) and we 
do not have to do anything to get it once we get Spans and they will nicely 
correlate all the urllib3 calls (so we will be immediately able to see how much 
data was transferred via urrlib3 in task X). That sounds cool, but then we can 
look at our metrics we have already which are specific for Airflow, and add 
them in. 
   
   Uff. Finally got my thoughts written down...
   
   
   Now. Let me summarise then what we can get comparing to the current 
statsd/logging/metrics we have:
   
   * by using that instrumenting libraries, we will get automatically low-level 
tracing for them [we do not have them now]
   
   * by correlating the events from different libraries with the "higher-level 
events" of Airflow and current  (for example task execution) and "higher-level 
metrics" of Airflow (for example delay in task execution) we can have a 
consistent view on how "each task" behaves (so to speak). [we have only the 
high-level metrics now - without the low-level correlated data]
   
   * by plugging-in the existing exporters, we have an easy way to open-up to a 
number of  different  tracing solution - not only Grafana, Prometheus etc., but 
also AWS CloudWatch,  GCP CloudTrace etc. - those all come for free basically. 
   
   Do I get it right? Are we on the same page?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to