ArshiAAkhavan opened a new issue #17678:
URL: https://github.com/apache/airflow/issues/17678
**Description**
there are three main ways for storing logs in airflow in `k8s` including
shared `persistVolume`, `s3` object storage, and `elasticsearch`
putting shared volume aside, the other two haven't been implemented nicely
when using `kubernetesExecutor` as your executor in airflow
the current flow (i have tested the elastic way so i explain it with
`elasticsearch`):
- for each task `kubernetesExecutor` creates a pod acting as the hosting
worker node
- task logs are persisted either on the pods local disk or stored in
containers log via `elasticsearch.write_json=true` in `airflow.cfg` file
- from there a log aggregator service (in my case `filebeats`+`logstash`)
reads from the log files and sends it to elastic search
the solution above can be improved because as of now
- you either have logs persisted on your containers local disk so you need a
`filebeat`/`logstash` container next to each worker container with shared
volume mounted on log directory
- or you have used `elasticsearch.write_json=true` config which enables you
to read from log file generated by `k8s` (usually in
`/var/log/container/*.log`) via a `daemonset` of `filebeat` which is
responsible for log aggregation for each node
although both of the solutions work, there is still a waste of resources in
both solutions
meanwhile we have the `kubernetesPodOperator` that does the fascinating job
of retrieving logs from the task-pods stdout via kubernetes APIs by default!
i was thinking that combining these two features
(`elasticsearch.write_stdout=true` and `kubernetesPodOperator` default
behavior) we are able to send logs from worker pods to the scheduler directly
and have them stored in the scheduler pod instead
**Use case / motivation**
well first of all in case you are deploying your scheduler and webserver
service outside of your `k8s` cluster, thats the end of the road for you since
you have your logs stored in a disk both visible from web-server and scheduler
if you are deploying your scheduler and webserver on `k8s` (which is a
common practice) then you still need the `logstash`/`filebeat` service to send
logs to your `elasticsearch` instance but this time you wont be needing a whole
`deamonset` or one instance per worker pod , just one per each scheduler pod
would suffice which is much less recourse usage (in my case i have only one
scheduler pod so its only 1!)
**What do you want to happen? **
the whole process of remote logging to `elasticsearch` is so hard compare to
other parts of deploying airflow when using `kubernetesExecutor` and i am
trying to ease up the process
also i feel like its more k8s-ish way to do!!
**Are you willing to submit a PR?**
if pointed to the right directions to look at, yes!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]