Repository: incubator-airflow Updated Branches: refs/heads/master 2078daca3 -> ebe715c56
[AIRFLOW-1691] Add better Google cloud logging documentation Closes #2671 from criccomini/fix-log-docs Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/ebe715c5 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/ebe715c5 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/ebe715c5 Branch: refs/heads/master Commit: ebe715c565ad9206c9db6a496a1f97326d5baf8a Parents: 2078dac Author: Chris Riccomini <[email protected]> Authored: Mon Oct 9 10:32:34 2017 -0700 Committer: Chris Riccomini <[email protected]> Committed: Mon Oct 9 10:32:34 2017 -0700 ---------------------------------------------------------------------- UPDATING.md | 6 ++-- docs/integration.rst | 71 +++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 74 insertions(+), 3 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/ebe715c5/UPDATING.md ---------------------------------------------------------------------- diff --git a/UPDATING.md b/UPDATING.md index 329f416..6a0b8bc 100644 --- a/UPDATING.md +++ b/UPDATING.md @@ -129,13 +129,13 @@ The `file_task_handler` logger is more flexible. You can change the default form #### I'm using S3Log or GCSLogs, what do I do!? -IF you are logging to either S3Log or GCSLogs, you will need a custom logging config. The `REMOTE_BASE_LOG_FOLDER` configuration key in your airflow config has been removed, therefore you will need to take the following steps: +If you are logging to Google cloud storage, please see the [Google cloud platform documentation](https://airflow.incubator.apache.org/integration.html#gcp-google-cloud-platform) for logging instructions. + +If you are using S3, the instructions should be largely the same as the Google cloud platform instructions above. You will need a custom logging config. The `REMOTE_BASE_LOG_FOLDER` configuration key in your airflow config has been removed, therefore you will need to take the following steps: - Copy the logging configuration from [`airflow/config_templates/airflow_logging_settings.py`](https://github.com/apache/incubator-airflow/blob/master/airflow/config_templates/airflow_local_settings.py) and copy it. - Place it in a directory inside the Python import path `PYTHONPATH`. If you are using Python 2.7, ensuring that any `__init__.py` files exist so that it is importable. - Update the config by setting the path of `REMOTE_BASE_LOG_FOLDER` explicitly in the config. The `REMOTE_BASE_LOG_FOLDER` key is not used anymore. - Set the `logging_config_class` to the filename and dict. For example, if you place `custom_logging_config.py` on the base of your pythonpath, you will need to set `logging_config_class = custom_logging_config.LOGGING_CONFIG` in your config as Airflow 1.8. - -ELSE you don't need to change anything. If there is no custom config, the airflow config loader will still default to the same config. ### New Features http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/ebe715c5/docs/integration.rst ---------------------------------------------------------------------- diff --git a/docs/integration.rst b/docs/integration.rst index 3b50586..cd6cc68 100644 --- a/docs/integration.rst +++ b/docs/integration.rst @@ -184,6 +184,77 @@ Airflow has extensive support for the Google Cloud Platform. But note that most Operators are in the contrib section. Meaning that they have a *beta* status, meaning that they can have breaking changes between minor releases. +Logging +'''''''' + +Airflow can be configured to read and write task logs in Google cloud storage. +Follow the steps below to enable Google cloud storage logging. + +#. Airlfow's logging system requires a custom .py file to be located in the ``PYTHONPATH``, so that it's importable from Airflow. Start by creating a directory to store the config file. ``$AIRFLOW_HOME/config`` is recommended. +#. Set ``PYTHONPATH=$PYTHONPATH:<AIRFLOW_HOME>/config`` in the Airflow environment. If using Supervisor, you can set this in the ``supervisord.conf`` environment parameter. If not, you can export ``PYTHONPATH`` using your preferred method. +#. Create empty files called ``$AIRFLOW_HOME/config/log_config.py`` and ``$AIRFLOW_HOME/config/__init__.py``. +#. Copy the contents of ``airflow/config_templates/airflow_local_settings.py`` into the ``log_config.py`` file that was just created in the step above. +#. Customize the following portions of the template: + + .. code-block:: bash + + # Add this variable to the top of the file. Note the trailing slash. + GCS_LOG_FOLDER = 'gs://<bucket where logs should be persisted>/' + + # Rename DEFAULT_LOGGING_CONFIG to LOGGING CONFIG + LOGGING_CONFIG = ... + + # Add a GCSTaskHandler to the 'handlers' block of the LOGGING_CONFIG variable + 'gcs.task': { + 'class': 'airflow.utils.log.gcs_task_handler.GCSTaskHandler', + 'formatter': 'airflow.task', + 'base_log_folder': os.path.expanduser(BASE_LOG_FOLDER), + 'gcs_log_folder': GCS_LOG_FOLDER, + 'filename_template': FILENAME_TEMPLATE, + }, + + # Update the airflow.task and airflow.tas_runner blocks to be 'gcs.task' instead of 'file.task'. + 'loggers': { + 'airflow.task': { + 'handlers': ['gcs.task'], + ... + }, + 'airflow.task_runner': { + 'handlers': ['gcs.task'], + ... + }, + 'airflow': { + 'handlers': ['console'], + ... + }, + } + +#. Make sure a Google cloud platform connection hook has been defined in Airflow. The hook should have read and write access to the Google cloud storage bucket defined above in ``GCS_LOG_FOLDER``. + +#. Update ``$AIRFLOW_HOME/airflow.cfg`` to contain: + + .. code-block:: bash + + task_log_reader = gcs.task + logging_config_class = log_config.LOGGING_CONFIG + remote_log_conn_id = <name of the Google cloud platform hook> + +#. Restart the Airflow webserver and scheduler, and trigger (or wait for) a new task execution. +#. Verify that logs are showing up for newly executed tasks in the bucket you've defined. +#. Verify that the Google cloud storage viewer is working in the UI. Pull up a newly executed task, and verify that you see something like: + + .. code-block:: bash + + *** Reading remote log from gs://<bucket where logs should be persisted>/example_bash_operator/run_this_last/2017-10-03T00:00:00/16.log. + [2017-10-03 21:57:50,056] {cli.py:377} INFO - Running on host chrisr-00532 + [2017-10-03 21:57:50,093] {base_task_runner.py:115} INFO - Running: ['bash', '-c', u'airflow run example_bash_operator run_this_last 2017-10-03T00:00:00 --job_id 47 --raw -sd DAGS_FOLDER/example_dags/example_bash_operator.py'] + [2017-10-03 21:57:51,264] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,263] {__init__.py:45} INFO - Using executor SequentialExecutor + [2017-10-03 21:57:51,306] {base_task_runner.py:98} INFO - Subtask: [2017-10-03 21:57:51,306] {models.py:186} INFO - Filling up the DagBag from /airflow/dags/example_dags/example_bash_operator.py + +Note the top line that says it's reading from the remote log file. + +Please be aware that if you were persisting logs to Google cloud storage using the old-style airflow.cfg configuration method, the old logs will no longer be visible in the Airflow UI, though they'll still exist in Google cloud storage. This is a backwards incompatbile change. If you are unhappy with it, you can change the ``FILENAME_TEMPLATE`` to reflect the old-style log filename format. + BigQuery ''''''''
