[
https://issues.apache.org/jira/browse/AIRFLOW-3370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16715356#comment-16715356
]
ASF GitHub Bot commented on AIRFLOW-3370:
-----------------------------------------
rhwang10 opened a new pull request #4303: [AIRFLOW-3370] Elasticsearch log task
handler additional features
URL: https://github.com/apache/incubator-airflow/pull/4303
Make sure you have checked _all_ steps below.
### Jira
- [x] My PR addresses the following
[Airflow-3370](https://issues.apache.org/jira/browse/AIRFLOW-3370)
### Description
Add additional options and documentation for using the
ElasticsearchTaskHandler.
The easiest and most foolproof way to implement a logging solution is to
write to the standard output streams (stdout and stderr). In the Kubernetes
ecosystem, given that pods are killed and restarted constantly, this implies
persistent storage is a requirement for log history preservation.
For the webserver and scheduler components of Airflow, logging to standard
output streams is built in. However, when tasks are executed in Celery, workers
will fork off child processes to execute tasks concurrently. Before the child
process ends, it makes a call to the Airflow task logger, and a task log file
is written to the file system.
This potentially causes several problems with the Airflow on top of
Kubernetes architecture. Given that Airflow has a constant stream of log
output, running an Airflow environment using Celery in a Kubernetes cluster
requires large amounts of memory resources. As such, when memory resources are
exceeded, either 1) worker pods are often evicted by Kubernetes, or 2) worker
output stalls and tasks pile up without completing. A current best-case
workaround is to run a sidecar container that tails the `stdout` and `stderr`
streams to read logs from the filesystem of the worker node and then output
those logs to its own standard output. However, this becomes a scaling issue
when running multiple deployments, and is unsustainable for massive Airflow
deployments.
The options that are added in this PR look to circumvent the need for
persistent volumes in worker nodes when running Airflow on Kubernetes. Workers
will no longer need to be Stateful Sets and can instead be Deployments.
The `elasticsearch_write_stdout` flag in `airflow.cfg` will allow the child
process to write its log output to the parent process standard output stream.
The `elasticsearch_json_format` flag in `airflow.cfg` allows additional
optional JSON configuration for task instances based on the `logging` module
LogRecord attributes. This must be used in conjunction with the
`elasticsearch_record_labels` configuration.
A potential use case for these options is precisely when setting up a log
monitoring stack, such as EFK (Elasticsearch FluentD Kibana), without requiring
persistent volumes. A FluentD daemon listening on every node is awaiting log
output on the standard output stream. With the `write_stdout` flag set, it can
capture the task log information that has been executed on child processes from
the parent process standard output. Using the `json_format` field, FluentD can
be configured to filter and specify log records, without needing to parse the
standard Airflow log formatted record, before sending it off to a destination,
such as Elasticsearch, where it can stored and handled independent of Airflow
on Kubernetes.
If either of these options are NULL, or not set, the `es_task_handler.py`
will function exactly as before, and will only have read functionality. The
options in this PR simply provide more functionality and options to users.
### Tests
- [x] My PR adds the following unit tests __OR__ does not need testing for
this extremely good reason:
#### handler close
- test_close_stdout_logs
- test_close_closed_stdout_logs
- test_close_no_mark_end_stdout_logs
- test_close_with_no_handler_stdout_logs
- test_close_with_no_stream_stdout_logs
#### handler read
- test_read_stdout_logs
- test_read_nonexistent_log_stdout_logs
- test_read_raises_stdout_logs_json_true
- test_read_timeout_stdout_logs
- test_read_with_empty_metadata_stdout_logs
- test_read_with_none_metadata_stdout_logs
#### handler render_log_id and set_context
- test_render_log_id_json_true
- test_set_context_stdout_logs
### Commits
- [x] My commits all reference Jira issues in their subject lines, and I
have squashed multiple commits if they address the same issue. In addition, my
commits follow the guidelines from "[How to write a good git commit
message](http://chris.beams.io/posts/git-commit/)":
1. Subject is separated from body by a blank line
1. Subject is limited to 50 characters (not including Jira issue reference)
1. Subject does not end with a period
1. Subject uses the imperative mood ("add", not "adding")
1. Body wraps at 72 characters
1. Body explains "what" and "why", not "how"
### Documentation
- [x] In case of new functionality, my PR adds documentation that describes
how to use it.
### Code Quality
- [x] Passes `flake8`
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Enhance current ES handler with stdout capability and more output options
> -------------------------------------------------------------------------
>
> Key: AIRFLOW-3370
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3370
> Project: Apache Airflow
> Issue Type: Improvement
> Components: celery, logging, worker
> Affects Versions: 1.10.0
> Reporter: Robert Hwang
> Priority: Major
>
> Currently, the ES handler in 1.10 can only search from ES, and it's not. Two
> possible enhancements are to allow the handler a "write to standard out"
> functionality, and make log messages more remote-storage friendly, with an
> optional JSON-formatted message, and the regular pretty-print log format as
> the default
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)