ArshiAAkhavan opened a new issue #17678:
URL: https://github.com/apache/airflow/issues/17678


   **Description**
   there are three main ways for storing logs in airflow in `k8s` including 
shared `persistVolume`, `s3` object storage, and `elasticsearch`
   putting shared volume aside, the other two haven't been implemented nicely 
when using `kubernetesExecutor` as your executor in airflow
   the current flow (i have tested the elastic way so i explain it with 
`elasticsearch`):
   - for each task `kubernetesExecutor` creates a pod acting as the hosting 
worker node
   - task logs are persisted either on the pods local disk or stored in 
containers log via `elasticsearch.write_json=true` in `airflow.cfg` file
   - from there a log aggregator service (in my case `filebeats`+`logstash`) 
reads from the log files and sends it to elastic search
   
   the solution above can be improved because as of now
   - you either have logs persisted on your containers local disk so you need a 
`filebeat`/`logstash` container next to each worker container with shared 
volume mounted on log directory
   - or you have used `elasticsearch.write_json=true` config which enables you 
to read from log file generated by `k8s` (usually in 
`/var/log/container/*.log`) via a `daemonset` of `filebeat` which is 
responsible for log aggregation for each node
   
   although both of the solutions work, there is still a waste of resources in 
both solutions
   
   meanwhile we have the `kubernetesPodOperator` that does the fascinating job 
of retrieving logs from the task-pods stdout via kubernetes APIs by default!
   
   i was thinking that combining these two features 
(`elasticsearch.write_stdout=true` and `kubernetesPodOperator` default 
behavior) we are able to send logs from worker pods to the scheduler directly 
and have them stored in the scheduler pod instead
    
   **Use case / motivation**
   well first of all in case you are deploying your scheduler and webserver 
service outside of your `k8s` cluster, thats the end of the road for you since 
you have your logs stored in a disk both visible from web-server and scheduler
   
   if you are deploying your scheduler and webserver on `k8s` (which is a 
common practice) then you still need the `logstash`/`filebeat` service to send 
logs to your `elasticsearch` instance but this time you wont be needing a whole 
`deamonset` or one instance per worker pod , just one per each scheduler pod 
would suffice which is much less recourse usage (in my case i have only one 
scheduler pod so its only 1!)
   
   **What do you want to happen? **
   the whole process of remote logging to `elasticsearch` is so hard compare to 
other parts of deploying airflow when using `kubernetesExecutor` and i am 
trying to ease up the process
   
   also i feel like its more k8s-ish way to do!!
   
   **Are you willing to submit a PR?**
   if pointed to the right directions to look at, yes!
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to