jedcunningham commented on code in PR #34381: URL: https://github.com/apache/airflow/pull/34381#discussion_r1326603097
########## airflow/providers/amazon/aws/config_templates/config.yml: ########## @@ -0,0 +1,131 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +--- + +aws_ecs_executor: + description: | + This section only applies if you are using the AwsEcsExecutor in + Airflow's ``[core]`` configuration. + For more information on any of these execution parameters, see the link below: + https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs/client/run_task.html + For boto3 credential management, see + https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html + options: + conn_id: + description: | + The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. + version_added: "2.8" + type: string + example: "aws_default" + default: "aws_default" + region: + description: | + The name of the AWS Region where Amazon ECS is configured. Required. + version_added: "2.8" + type: string + example: "us-east-1" + default: ~ + assign_public_ip: + description: | + Whether to assign a public IP address to the containers launched by the ECS executor. + For more info see url to Boto3 docs above. + version_added: "2.8" + type: boolean + example: "True" + default: "False" + cluster: + description: | + Name of the Amazon ECS Cluster. Required. + version_added: "2.8" + type: string + example: "ecs_executor_cluster" + default: ~ + container_name: + description: | + Name of the container that will be used to execute Airflow tasks via the ECS executor. + The container should be specified in the ECS Task Definition and will receive an airflow + CLI command as an additional parameter to its entrypoint. For more info see url to Boto3 + docs above. Required. + version_added: "2.8" + type: string + example: "ecs_executor_container" + default: ~ + launch_type: + description: | + Launch type can either be 'FARGATE' OR 'EC2'. For more info see url to + Boto3 docs above. + + If the launch type is EC2, the executor will attempt to place tasks on + empty EC2 instances. If there are no EC2 instances available, no task + is placed and this function will be called again in the next heart-beat. + + If the launch type is FARGATE, this will run the tasks on new AWS Fargate + instances. + version_added: "2.8" + type: string + example: "FARGATE" + default: "FARGATE" + platform_version: + description: | + The platform version the task uses. A platform version is only specified + for tasks hosted on Fargate. If one isn't specified, the LATEST platform + version is used. + version_added: "2.8" + type: string + example: "1.4.0" + default: "LATEST" + security_groups: + description: | + The comma-seperated IDs of the security groups associated with the task. If you + don't specify a security group, the default security group for the VPC is used. + There's a limit of 5 security groups. For more info see url to Boto3 docs above. + version_added: "2.8" + type: string + example: "sg-XXXX,sg-YYYY" + default: ~ + subnets: + description: | + The comma-separated IDs of the subnets associated with the task or service. + There's a limit of 16 subnets. For more info see url to Boto3 docs above. + version_added: "2.8" + type: string + example: "subnet-XXXXXXXX,subnet-YYYYYYYY" + default: ~ + task_definition: + description: | + The family and revision (family:revision) or full ARN of the task definition + to run. If a revision isn't specified, the latest ACTIVE revision is used. + For more info see url to Boto3 docs above. + version_added: "2.8" + type: string + example: executor_task_definition:LATEST + default: ~ + max_run_task_attempts: + description: | + The maximum number of times the Ecs Executor should attempt to run a task. + version_added: "2.8" + type: int + example: "3" + default: "3" Review Comment: This feels like a dangerous default (and maybe even feature). Don't we want Airflows retries to control this instead? ########## airflow/providers/amazon/aws/config_templates/config.yml: ########## @@ -0,0 +1,131 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +--- + +aws_ecs_executor: + description: | + This section only applies if you are using the AwsEcsExecutor in + Airflow's ``[core]`` configuration. + For more information on any of these execution parameters, see the link below: + https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ecs/client/run_task.html + For boto3 credential management, see + https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html + options: + conn_id: + description: | + The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. + version_added: "2.8" Review Comment: This should be the version of the provider this is added to, not the next core minor. ########## airflow/providers/amazon/aws/executors/ecs/Setup_guide.md: ########## @@ -0,0 +1,148 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# Setting up an ECS Executor for Apache Airflow + +There are 3 steps involved in getting an ECS Executor to work in Apache Airflow: + +1. Creating a database that Airflow and the Executor can connect to. Review Comment: ```suggestion 1. Creating a database that Airflow and the tasks running in ECS can connect to. ``` ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install Review Comment: If you combine these into one step you can toss the zip file so it doesn't sit in a layer. Might also want to toss the expanded files too? ########## airflow/providers/amazon/aws/executors/ecs/README.md: ########## @@ -0,0 +1,196 @@ +<!-- Review Comment: Good doc. It should probably be moved to the providers real docs though, instead of being in the source here. ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install + +# Add a script to run the aws s3 sync command when the container is run +COPY <<"EOF" /entrypoint.sh +#!/bin/bash + +echo "Downloading DAGs from S3 bucket" +aws s3 sync "$S3_URL" "$CONTAINER_DAG_PATH" + +exec "$@" +EOF + +RUN chmod +x /entrypoint.sh + +USER airflow + +## Installing Python Dependencies +# Python dependencies can be installed by providing a requirements.txt. +# If the file is in a different location, use the requirements_path build argument to specify +# the file path. +ARG requirements_path=./requirements.txt +ENV REQUIREMENTS_PATH=$requirements_path + +# Uncomment the two lines below to copy the requirements.txt file to the container, and +# install the dependencies. +# COPY --chown=airflow:root $REQUIREMENTS_PATH /opt/airflow/requirements.txt +# RUN pip install --no-cache-dir -r /opt/airflow/requirements.txt + + +## AWS Authentication +# The image requires access to AWS services. This Dockerfile supports 2 ways to authenticate with AWS. +# The first is using build arguments where you can provide the AWS credentials as arguments +# passed when building the image. The other option is to copy the ~/.aws folder to the container, +# and authenticate using the credentials in that folder. +# If you would like to use an alternative method of authentication, feel free to make the +# necessary changes to this file. + +# Use these arguments to provide AWS authentication information +ARG aws_access_key_id +ARG aws_secret_access_key +ARG aws_default_region +ARG aws_session_token + +ENV AWS_ACCESS_KEY_ID=$aws_access_key_id Review Comment: Not very familiar with ECS, but shouldn't we inject these somehow instead of baking creds into the image directly? That's a red flag pattern in my eyes. ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install + +# Add a script to run the aws s3 sync command when the container is run +COPY <<"EOF" /entrypoint.sh +#!/bin/bash + +echo "Downloading DAGs from S3 bucket" +aws s3 sync "$S3_URL" "$CONTAINER_DAG_PATH" + +exec "$@" +EOF + +RUN chmod +x /entrypoint.sh + +USER airflow + +## Installing Python Dependencies +# Python dependencies can be installed by providing a requirements.txt. +# If the file is in a different location, use the requirements_path build argument to specify +# the file path. +ARG requirements_path=./requirements.txt +ENV REQUIREMENTS_PATH=$requirements_path + +# Uncomment the two lines below to copy the requirements.txt file to the container, and +# install the dependencies. +# COPY --chown=airflow:root $REQUIREMENTS_PATH /opt/airflow/requirements.txt +# RUN pip install --no-cache-dir -r /opt/airflow/requirements.txt + + +## AWS Authentication +# The image requires access to AWS services. This Dockerfile supports 2 ways to authenticate with AWS. +# The first is using build arguments where you can provide the AWS credentials as arguments +# passed when building the image. The other option is to copy the ~/.aws folder to the container, +# and authenticate using the credentials in that folder. +# If you would like to use an alternative method of authentication, feel free to make the +# necessary changes to this file. + +# Use these arguments to provide AWS authentication information +ARG aws_access_key_id +ARG aws_secret_access_key +ARG aws_default_region +ARG aws_session_token + +ENV AWS_ACCESS_KEY_ID=$aws_access_key_id +ENV AWS_SECRET_ACCESS_KEY=$aws_secret_access_key +ENV AWS_DEFAULT_REGION=$aws_default_region +ENV AWS_SESSION_TOKEN=$aws_session_token + +# Uncomment the line below to authenticate to AWS using the ~/.aws folder +# Keep in mind the docker build context when placing .aws folder +# COPY --chown=airflow:root ./.aws /home/airflow/.aws Review Comment: This pattern also seems problematic 🤷♂️ ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install + +# Add a script to run the aws s3 sync command when the container is run +COPY <<"EOF" /entrypoint.sh +#!/bin/bash + +echo "Downloading DAGs from S3 bucket" +aws s3 sync "$S3_URL" "$CONTAINER_DAG_PATH" + +exec "$@" +EOF + +RUN chmod +x /entrypoint.sh + +USER airflow + +## Installing Python Dependencies +# Python dependencies can be installed by providing a requirements.txt. +# If the file is in a different location, use the requirements_path build argument to specify +# the file path. +ARG requirements_path=./requirements.txt +ENV REQUIREMENTS_PATH=$requirements_path + +# Uncomment the two lines below to copy the requirements.txt file to the container, and +# install the dependencies. +# COPY --chown=airflow:root $REQUIREMENTS_PATH /opt/airflow/requirements.txt +# RUN pip install --no-cache-dir -r /opt/airflow/requirements.txt + + +## AWS Authentication +# The image requires access to AWS services. This Dockerfile supports 2 ways to authenticate with AWS. +# The first is using build arguments where you can provide the AWS credentials as arguments +# passed when building the image. The other option is to copy the ~/.aws folder to the container, +# and authenticate using the credentials in that folder. +# If you would like to use an alternative method of authentication, feel free to make the +# necessary changes to this file. + +# Use these arguments to provide AWS authentication information +ARG aws_access_key_id +ARG aws_secret_access_key +ARG aws_default_region +ARG aws_session_token + +ENV AWS_ACCESS_KEY_ID=$aws_access_key_id +ENV AWS_SECRET_ACCESS_KEY=$aws_secret_access_key +ENV AWS_DEFAULT_REGION=$aws_default_region +ENV AWS_SESSION_TOKEN=$aws_session_token + +# Uncomment the line below to authenticate to AWS using the ~/.aws folder +# Keep in mind the docker build context when placing .aws folder +# COPY --chown=airflow:root ./.aws /home/airflow/.aws Review Comment: This pattern also seems problematic 🤷♂️ ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install + +# Add a script to run the aws s3 sync command when the container is run +COPY <<"EOF" /entrypoint.sh +#!/bin/bash + +echo "Downloading DAGs from S3 bucket" +aws s3 sync "$S3_URL" "$CONTAINER_DAG_PATH" + +exec "$@" +EOF + +RUN chmod +x /entrypoint.sh + +USER airflow + +## Installing Python Dependencies +# Python dependencies can be installed by providing a requirements.txt. +# If the file is in a different location, use the requirements_path build argument to specify +# the file path. +ARG requirements_path=./requirements.txt +ENV REQUIREMENTS_PATH=$requirements_path Review Comment: Should these be commented out as well, like the section below? ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest +USER root +RUN apt-get update \ + && apt-get install -y --no-install-recommends unzip \ + # The below helps to keep the image size down + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* +RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" +RUN unzip awscliv2.zip && ./aws/install + +# Add a script to run the aws s3 sync command when the container is run +COPY <<"EOF" /entrypoint.sh +#!/bin/bash + +echo "Downloading DAGs from S3 bucket" +aws s3 sync "$S3_URL" "$CONTAINER_DAG_PATH" + +exec "$@" +EOF + +RUN chmod +x /entrypoint.sh + +USER airflow + +## Installing Python Dependencies +# Python dependencies can be installed by providing a requirements.txt. +# If the file is in a different location, use the requirements_path build argument to specify +# the file path. +ARG requirements_path=./requirements.txt +ENV REQUIREMENTS_PATH=$requirements_path + +# Uncomment the two lines below to copy the requirements.txt file to the container, and +# install the dependencies. +# COPY --chown=airflow:root $REQUIREMENTS_PATH /opt/airflow/requirements.txt +# RUN pip install --no-cache-dir -r /opt/airflow/requirements.txt + + +## AWS Authentication +# The image requires access to AWS services. This Dockerfile supports 2 ways to authenticate with AWS. +# The first is using build arguments where you can provide the AWS credentials as arguments +# passed when building the image. The other option is to copy the ~/.aws folder to the container, +# and authenticate using the credentials in that folder. +# If you would like to use an alternative method of authentication, feel free to make the +# necessary changes to this file. + +# Use these arguments to provide AWS authentication information +ARG aws_access_key_id +ARG aws_secret_access_key +ARG aws_default_region +ARG aws_session_token + +ENV AWS_ACCESS_KEY_ID=$aws_access_key_id +ENV AWS_SECRET_ACCESS_KEY=$aws_secret_access_key +ENV AWS_DEFAULT_REGION=$aws_default_region +ENV AWS_SESSION_TOKEN=$aws_session_token + +# Uncomment the line below to authenticate to AWS using the ~/.aws folder +# Keep in mind the docker build context when placing .aws folder +# COPY --chown=airflow:root ./.aws /home/airflow/.aws + + +## Loading DAGs +# This Dockerfile supports 2 ways to load DAGs onto the container. +# One is to upload all the DAGs onto an S3 bucket, and then +# download them onto the container. The other is to copy a local folder with +# the DAGs onto the container. +# If you would like to use an alternative method of loading DAGs, feel free to make the +# necessary changes to this file. + +ARG host_dag_path=./dags +ENV HOST_DAG_PATH=$host_dag_path +# Set host_dag_path to the path of the DAGs on the host +# COPY --chown=airflow:root $HOST_DAG_PATH $CONTAINER_DAG_PATH + + +# If using S3 bucket as source of DAGs, uncommenting the next ENTRYPOINT command will overwrite this one. +ENTRYPOINT [] + +# Use these arguments to load DAGs onto the container from S3 +ARG s3_url +ENV S3_URL=$s3_url +ARG container_dag_path=/opt/airflow/dags +ENV CONTAINER_DAG_PATH=$container_dag_path +# Uncomment the line if using S3 bucket as the source of DAGs +# ENTRYPOINT ["/entrypoint.sh"] Review Comment: This would mean the OSS entrypoint is skipped. Should we wrap it instead? ########## airflow/providers/amazon/aws/executors/ecs/README.md: ########## @@ -0,0 +1,196 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# AWS ECS Executor + +This is an Airflow executor powered by Amazon Elastic Container Service (ECS). Each task that Airflow schedules for execution is run within its own ECS container. Some benefits of an executor like this include: + +1. Task isolation: No task can be a noisy neighbor for another. Resources like CPU, memory and disk are isolated to each individual task. Any actions or failures which affect networking or fail the entire container only affect the single task running in it. No single user can overload the environment by triggering too many tasks, because there are no shared workers. +2. Customized environments: You can build different container images which incorporate specific dependencies (such as system level dependencies), binaries, or data required for a task to run. +3. Cost effective: Compute resources only exist for the lifetime of the Airflow task itself. This saves costs by not requiring persistent/long lived workers ready at all times, which also need maintenance and patching. + +For a quick start guide please see [here](Setup_guide.md), it will get you up and running with a basic configuration. + +The below sections provide more generic details about configuration, the provided example Dockerfile and logging. + +## Config Options + +There are a number of configuration options available, which can either be set directly in the airflow.cfg +file under an "aws_ecs_executor" section or via environment variables using the `AIRFLOW__AWS_ECS_EXECUTOR__<OPTION_NAME>` +format, for example `AIRFLOW__AWS_ECS_EXECUTOR__CONTAINER_NAME = "myEcsContainer"`. For more information +on how to set these options, see [Setting Configuration Options](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) + +In the case of conflicts, the order of precedence is: + +1. Load default values for options which have defaults. +2. Load any values provided in the RUN_TASK_KWARGS option if one is provided. +3. Load any values explicitly provided through airflow.cfg or environment variables. These are checked with Airflow's config precedence. + +### Required config options: + +- CLUSTER - Name of the Amazon ECS Cluster. Required. +- CONTAINER_NAME - Name of the container that will be used to execute Airflow tasks via the ECS executor. +The container should be specified in the ECS Task Definition. Required. +- REGION - The name of the AWS Region where Amazon ECS is configured. Required. + +### Optional config options: + +- ASSIGN_PUBLIC_IP - "Whether to assign a public IP address to the containers launched by the ECS executor. Defaults to "False". +- CONN_ID - The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. Defaults to "aws_default". +- LAUNCH_TYPE - Launch type can either be 'FARGATE' OR 'EC2'. Defaults to "FARGATE". +- PLATFORM_VERSION - The platform version the ECS task uses if the FARGATE launch type is used. Defaults to "LATEST". +- RUN_TASK_KWARGS - A JSON string containing arguments to provide the ECS `run_task` API. +- SECURITY_GROUPS - Up to 5 comma-seperated security group IDs associated with the ECS task. Defaults to the VPC default. +- SUBNETS - Up to 16 comma-separated subnet IDs associated with the ECS task or service. Defaults to the VPC default. +- TASK_DEFINITION - The family and revision (family:revision) or full ARN of the ECS task definition to run. Defaults to the latest ACTIVE revision. +- MAX_RUN_TASK_ATTEMPTS - The maximum number of times the Ecs Executor should attempt to run a task. + +For a more detailed description of available options, including type hints and examples, see the `config_templates` folder in the Amazon provider package. + +## Dockerfile for ECS Executor + +An example Dockerfile can be found [here](Dockerfile#), it creates an image that can be used on an ECS container to run Airflow tasks using the AWS ECS Executor in Apache Airflow. The image +supports AWS CLI/API integration, allowing you to interact with AWS services within your Airflow environment. It also includes options to load DAGs (Directed Acyclic Graphs) from either an S3 bucket or a local folder. + +### Base Image + +The Docker image is built upon the `apache/airflow:latest` image. See [here](https://hub.docker.com/r/apache/airflow) for more information about the image. + +Important note: The python version in this image must match the python version on the host/container which is running the Airflow scheduler process (which in turn runs the executor). The python version of the image can be verified by running the container, and printing the python version as follows: + +``` +docker run <image_name> python --version +``` + +Ensure that this version matches the python version of the host/container which is running the Airflow scheduler process (and thus, the ECS executor.) Apache Airflow images with specific python versions can be downloaded from the Dockerhub registry, and filtering tags by the [python version](https://hub.docker.com/r/apache/airflow/tags?page=1&name=3.8). For example, the tag `latest-python3.8` specifies that the image will have python 3.8 installed. + +### Prerequisites + +Docker must be installed on your system. Instructions for installing Docker can be found [here](https://docs.docker.com/get-docker/). + +### AWS Credentials + +The [AWS CLI](https://aws.amazon.com/cli/) is installed within the container, and there are multiple ways to pass AWS authentication information to the container. This guide will cover 2 methods. + +The first method is to use the build-time arguments (`aws_access_key_id`, `aws_secret_access_key`, `aws_default_region`, and `aws_session_token`). +To pass AWS authentication information using these arguments, use the `--build-arg` option during the Docker build process. For example: + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN . +``` + +Replace `YOUR_ACCESS_KEY`, `YOUR_SECRET_KEY`, `YOUR_SESSION_TOKEN`, and `YOUR_DEFAULT_REGION` with valid AWS credentials. + +Alternatively, you can authenticate to AWS using the `~/.aws` folder. See instructions on how to generate this folder [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Uncomment the line in the Dockerfile to copy the `./.aws` folder from your host machine to the container's `/home/airflow/.aws` directory. Keep in mind the Docker build context when copying the `.aws` folder to the container. + +### Loading DAGs + +There are many ways to load DAGs on the ECS container. This Dockerfile is preconfigured with two possible ways: copying from a local folder, or downloading from an S3 bucket. Other methods of loading DAGs are possible as well. + +#### From S3 Bucket + +To load DAGs from an S3 bucket, uncomment the entrypoint line in the Dockerfile to synchronize the DAGs from the specified S3 bucket to the `/opt/airflow/dags` directory inside the container. You can optionally provide `container_dag_path` as a build argument if you want to store the DAGs in a directory other than `/opt/airflow/dags`. + +Add `--build-arg s3_url=YOUR_S3_URL` in the docker build command. +Replace `YOUR_S3_URL` with the URL of your S3 bucket. Make sure you have the appropriate permissions to read from the bucket. + +Note that the following command is also passing in AWS credentials as build arguments. + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN \ + --build-arg s3_url=YOUR_S3_URL . +``` + + +#### From Local Folder + +To load DAGs from a local folder, place your DAG files in a folder within the docker build context on your host machine, and provide the location of the folder using the `host_dag_path` build argument. By default, the DAGs will be copied to `/opt/airflow/dags`, but this can be changed by passing the `container_dag_path` build-time argument during the Docker build process: + +``` +docker build -t my-airflow-image --build-arg host_dag_path=./dags_on_host --build-arg container_dag_path=/path/on/container . +``` + +If choosing to load DAGs onto a different path than `/opt/airflow/dags`, then the new path will need to be updated in the Airflow config. + +#### Mounting a Volume Review Comment: Not sure why we have this section in the readme? What does volume mounting a dir in a local container have to do with the ECS executor? ########## airflow/providers/amazon/aws/executors/ecs/README.md: ########## @@ -0,0 +1,196 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# AWS ECS Executor + +This is an Airflow executor powered by Amazon Elastic Container Service (ECS). Each task that Airflow schedules for execution is run within its own ECS container. Some benefits of an executor like this include: + +1. Task isolation: No task can be a noisy neighbor for another. Resources like CPU, memory and disk are isolated to each individual task. Any actions or failures which affect networking or fail the entire container only affect the single task running in it. No single user can overload the environment by triggering too many tasks, because there are no shared workers. +2. Customized environments: You can build different container images which incorporate specific dependencies (such as system level dependencies), binaries, or data required for a task to run. +3. Cost effective: Compute resources only exist for the lifetime of the Airflow task itself. This saves costs by not requiring persistent/long lived workers ready at all times, which also need maintenance and patching. + +For a quick start guide please see [here](Setup_guide.md), it will get you up and running with a basic configuration. + +The below sections provide more generic details about configuration, the provided example Dockerfile and logging. + +## Config Options + +There are a number of configuration options available, which can either be set directly in the airflow.cfg +file under an "aws_ecs_executor" section or via environment variables using the `AIRFLOW__AWS_ECS_EXECUTOR__<OPTION_NAME>` +format, for example `AIRFLOW__AWS_ECS_EXECUTOR__CONTAINER_NAME = "myEcsContainer"`. For more information +on how to set these options, see [Setting Configuration Options](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) + +In the case of conflicts, the order of precedence is: + +1. Load default values for options which have defaults. +2. Load any values provided in the RUN_TASK_KWARGS option if one is provided. +3. Load any values explicitly provided through airflow.cfg or environment variables. These are checked with Airflow's config precedence. + +### Required config options: + +- CLUSTER - Name of the Amazon ECS Cluster. Required. +- CONTAINER_NAME - Name of the container that will be used to execute Airflow tasks via the ECS executor. +The container should be specified in the ECS Task Definition. Required. +- REGION - The name of the AWS Region where Amazon ECS is configured. Required. + +### Optional config options: + +- ASSIGN_PUBLIC_IP - "Whether to assign a public IP address to the containers launched by the ECS executor. Defaults to "False". +- CONN_ID - The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. Defaults to "aws_default". +- LAUNCH_TYPE - Launch type can either be 'FARGATE' OR 'EC2'. Defaults to "FARGATE". +- PLATFORM_VERSION - The platform version the ECS task uses if the FARGATE launch type is used. Defaults to "LATEST". +- RUN_TASK_KWARGS - A JSON string containing arguments to provide the ECS `run_task` API. +- SECURITY_GROUPS - Up to 5 comma-seperated security group IDs associated with the ECS task. Defaults to the VPC default. +- SUBNETS - Up to 16 comma-separated subnet IDs associated with the ECS task or service. Defaults to the VPC default. +- TASK_DEFINITION - The family and revision (family:revision) or full ARN of the ECS task definition to run. Defaults to the latest ACTIVE revision. +- MAX_RUN_TASK_ATTEMPTS - The maximum number of times the Ecs Executor should attempt to run a task. + +For a more detailed description of available options, including type hints and examples, see the `config_templates` folder in the Amazon provider package. + +## Dockerfile for ECS Executor + +An example Dockerfile can be found [here](Dockerfile#), it creates an image that can be used on an ECS container to run Airflow tasks using the AWS ECS Executor in Apache Airflow. The image +supports AWS CLI/API integration, allowing you to interact with AWS services within your Airflow environment. It also includes options to load DAGs (Directed Acyclic Graphs) from either an S3 bucket or a local folder. + +### Base Image + +The Docker image is built upon the `apache/airflow:latest` image. See [here](https://hub.docker.com/r/apache/airflow) for more information about the image. + +Important note: The python version in this image must match the python version on the host/container which is running the Airflow scheduler process (which in turn runs the executor). The python version of the image can be verified by running the container, and printing the python version as follows: + +``` +docker run <image_name> python --version +``` + +Ensure that this version matches the python version of the host/container which is running the Airflow scheduler process (and thus, the ECS executor.) Apache Airflow images with specific python versions can be downloaded from the Dockerhub registry, and filtering tags by the [python version](https://hub.docker.com/r/apache/airflow/tags?page=1&name=3.8). For example, the tag `latest-python3.8` specifies that the image will have python 3.8 installed. + +### Prerequisites + +Docker must be installed on your system. Instructions for installing Docker can be found [here](https://docs.docker.com/get-docker/). + +### AWS Credentials + +The [AWS CLI](https://aws.amazon.com/cli/) is installed within the container, and there are multiple ways to pass AWS authentication information to the container. This guide will cover 2 methods. + +The first method is to use the build-time arguments (`aws_access_key_id`, `aws_secret_access_key`, `aws_default_region`, and `aws_session_token`). +To pass AWS authentication information using these arguments, use the `--build-arg` option during the Docker build process. For example: + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN . +``` + +Replace `YOUR_ACCESS_KEY`, `YOUR_SECRET_KEY`, `YOUR_SESSION_TOKEN`, and `YOUR_DEFAULT_REGION` with valid AWS credentials. + +Alternatively, you can authenticate to AWS using the `~/.aws` folder. See instructions on how to generate this folder [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Uncomment the line in the Dockerfile to copy the `./.aws` folder from your host machine to the container's `/home/airflow/.aws` directory. Keep in mind the Docker build context when copying the `.aws` folder to the container. + +### Loading DAGs + +There are many ways to load DAGs on the ECS container. This Dockerfile is preconfigured with two possible ways: copying from a local folder, or downloading from an S3 bucket. Other methods of loading DAGs are possible as well. + +#### From S3 Bucket + +To load DAGs from an S3 bucket, uncomment the entrypoint line in the Dockerfile to synchronize the DAGs from the specified S3 bucket to the `/opt/airflow/dags` directory inside the container. You can optionally provide `container_dag_path` as a build argument if you want to store the DAGs in a directory other than `/opt/airflow/dags`. + +Add `--build-arg s3_url=YOUR_S3_URL` in the docker build command. +Replace `YOUR_S3_URL` with the URL of your S3 bucket. Make sure you have the appropriate permissions to read from the bucket. + +Note that the following command is also passing in AWS credentials as build arguments. + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN \ + --build-arg s3_url=YOUR_S3_URL . +``` + + +#### From Local Folder + +To load DAGs from a local folder, place your DAG files in a folder within the docker build context on your host machine, and provide the location of the folder using the `host_dag_path` build argument. By default, the DAGs will be copied to `/opt/airflow/dags`, but this can be changed by passing the `container_dag_path` build-time argument during the Docker build process: + +``` +docker build -t my-airflow-image --build-arg host_dag_path=./dags_on_host --build-arg container_dag_path=/path/on/container . +``` + +If choosing to load DAGs onto a different path than `/opt/airflow/dags`, then the new path will need to be updated in the Airflow config. + +#### Mounting a Volume + +You can optionally mount a local directory as a volume on the container during run-time. This will allow you to make change to files on the mounted directory, and have those changes be reflected in the container. To do this, run the following command: + +``` +docker run --volume /abs/path/to/local/dir:/abs/path/to/remote/dir <image_name> +``` + +Note: Doing this will overwrite the contents of the directory on the container with the contents of the local directory. + +### Installing Python Dependencies + +This Dockerfile supports installing Python dependencies via `pip` from a `requirements.txt` file. Place your `requirements.txt` file in the same directory as the Dockerfile. If it is in a different location, it can be specified using the `requirements_path` build-argument. Keep in mind the Docker context when copying the `requirements.txt` file. Uncomment the two appropriate lines in the Dockerfile that copy the `requirements.txt` file to the container, and run `pip install` to install the dependencies on the container. + +### Building Image for ECS Executor + +Detailed instructions on how to use the Docker image, that you have created via this readme, with the ECS Executor can be found [here](link_to_how_to_guide). + + +## Logging + +Airflow tasks executed via this executor run in ECS containers within the configured VPC. This means that logs are not directly accessible to the Airflow Webserver and when containers are stopped, after task completion, the logs would be permanently lost. + +Remote logging can be employed when using the ECS executor to persist your Airflow Task logs and make them viewable from the Airflow Webserver. + +### Configuring Remote Logging + +There are many ways to configure remote logging and several supported destinations. A general overview of Airflow Task logging can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-tasks.html). Instructions for configuring S3 remote logging can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/s3-task-handler.html) and Cloudwatch remote logging [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/cloud-watch-task-handlers.html). +Some important things to point out for remote logging in the context of the ECS executor: + + - The configuration options for Airflow remote logging must be configured on the host running the Airflow Webserver (so that it can fetch logs from the remote location) as well as within the ECS container running the Airflow Tasks (so that it can upload the logs to the remote location). See [here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) to read more about how to set Airflow configuration via config file or environment variable exports. Review Comment: Config should be consistent everywhere, not sure why we need to call this out explicity. See this in the [config ref](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html): > Use the same configuration across all the Airflow components. ########## airflow/providers/amazon/aws/executors/ecs/Setup_guide.md: ########## @@ -0,0 +1,148 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# Setting up an ECS Executor for Apache Airflow + +There are 3 steps involved in getting an ECS Executor to work in Apache Airflow: + +1. Creating a database that Airflow and the Executor can connect to. +2. Creating and configuring an ECS Cluster that can run tasks from Airflow. +3. Configuring Airflow to use the ECS Executor and the database. + +There are different options for selecting a database backend. See [here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-up-database.html) for more information about the different options supported by Airflow. The following guide will explain how to set up a PostgreSQL RDS Instance on AWS. The guide will also cover setting up an ECS cluster. The ECS Executor supports various launch types, but this guide will explain how to set up an ECS Fargate cluster. + +## Setting up an RDS DB Instance for ECS Executors + +### Create the RDS DB Instance + +1. Log in to your AWS Management Console and navigate to the RDS service. +2. Click "Create database" to start creating a new RDS instance. +3. Choose the "Standard create" option, and select PostreSQL. +4. Select the appropriate template, availability and durability. + - NOTE: At the time of this writing, the "Multi-AZ DB **Cluster**" option does not support setting the database name, which is a required step below. +5. Set the DB Instance name, the username and password. +7. Choose the instance configuration, and storage parameters. +8. In the Connectivity section, select Don't connect to an EC2 compute resource +9. Select or create a VPC and subnet, and allow public access to the DB. Select or create security group and select the Availability Zone. +10. Open the Additional Configuration tab and set the database name to `airflow_db`. +11. Select other settings as required, and create the database by clicking Create database. + + +### Test Connectivity + +In order to be able to connect to the new RDS instance, you need to allow inbound traffic to the database from your IP address. + + +1. Under the "Security" heading in the "Connectivity & security" tab of the RDS instance, find the link to the VPC security group for your new RDS DB instance. +2. Create an inbound rule that allows traffic from your IP address(es) on TCP port 5432 (PostgreSQL). + +3. Confirm that you can connect to the DB after modifying the security group. This will require having `psql` installed. Instructions for installing `psql` can be found [here](https://www.postgresql.org/download/). + +**NOTE**: Be sure that the status of your DB is Available before testing connectivity + +``` +psql -h <endpoint> -p 5432 -U <username> <db_name> +``` + +The endpoint can be found on the "Connectivity and Security" tab, the username (and password) are the credentials used when creating the database. +The db_name should be `airflow_db` (unless a different one was used when creating the database.) + +You will be prompted to enter the password if the connection is successful. + + +## Creating an ECS Cluster with Fargate, and Task Definitions + +In order to create a Task Definition for the ECS Cluster that will work with Apache Airflow, you will need a Docker image that is properly configured. See the [Dockerfile](README.md#dockerfile-for-ecs-executor) section for instructions on how to do that. + +Once the image is built, it needs to be put in a repository where it can be pulled by ECS. There are multiple ways to accomplish this. This guide will go over doing this using Amazon Elastic Container Registry (ECR). + +### Create an ECR Repository + +1. Log in to your AWS Management Console and navigate to the ECR service. +2. Click Create repository. +3. Name the repository and fill out other information as required. +4. Click Create Repository. +5. Once the repository has been created, click on the repository. Click on the "View push commands" button on the top right. +6. Follow the instructions to push the Docker image, replacing image names as appropriate. Ensure the image is uploaded by refreshing the page once the image is pushed. + +### Create ECS Cluster + +1. Log in to your AWS Management Console and navigate to the Amazon Elastic Container Service. +2. Click "Clusters" then click "Create Cluster". +3. Make sure that AWS Fargate (Serverless) is selected under Infrastructure. +4. Select other options as required and click Create to create the cluster. + +### Create Task Definition + +1. Click on Task Definitions on the left hand bar, and click Create new task definition. +2. Choose the Task Definition Family name. Select AWS Fargate for the Launch Type. +3. Select or create the Task Role and Task Execution Role, and ensure the roles have the required permissions to accomplish their respective tasks. You can choose to create a new Task Execution role that will have the basic minimum permissions in order for the task to run. +4. Select a name for the container, and use the image URI of the image that was pushed in the previous section. Make sure the role being used has the required permissions to pull the image. +5. Add the following environment variables to the container: + + - `AIRFLOW__DATABASE__SQL_ALCHEMY_CONN`, with the value being the PostgreSQL connection string in the following format using the values set during the [Database section](#create-the-rds-db-instance) above: + +``` +postgresql+psycopg2://<username>:<password>@<endpoint>/<database_name> +``` + + - `AIRFLOW__ECS_EXECUTOR__SECURITY_GROUPS`, with the value being a comma separated list of security group IDs associated with the VPC used for the RDS instance. + - `AIRFLOW__ECS_EXECUTOR__SUBNETS`, with the value being a comma separated list of subnet IDs of the subnets associated with the RDS instance. + +6. Add other configuration as necessary for Airflow generally ([see here](https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html)), the ECS executor ([see here](README.md#config-options)) or for remote logging ([see here](README.md#logging)). Review Comment: Yeah, this is the section that needs to be revamped to make sure config is consistent everywhere! ########## airflow/providers/amazon/aws/config_templates/config.yml: ########## @@ -0,0 +1,131 @@ +# Licensed to the Apache Software Foundation (ASF) under one Review Comment: Isn't this meant to be [here](https://github.com/apache/airflow/blob/8ecd576de1043dbea40e5e16b5dc34859cc41725/airflow/providers/amazon/provider.yaml#L717) instead? ########## airflow/providers/amazon/aws/executors/ecs/Dockerfile: ########## @@ -0,0 +1,86 @@ +# hadolint ignore=DL3007 +FROM apache/airflow:latest Review Comment: We might consider a different home for this? It's not obvious this is an example to me. Maybe docs, or at least a different path that'll better reflect it? e.g. https://github.com/apache/airflow/blob/main/docs/apache-airflow/howto/docker-compose/docker-compose.yaml ########## airflow/providers/amazon/aws/executors/ecs/README.md: ########## @@ -0,0 +1,196 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# AWS ECS Executor + +This is an Airflow executor powered by Amazon Elastic Container Service (ECS). Each task that Airflow schedules for execution is run within its own ECS container. Some benefits of an executor like this include: + +1. Task isolation: No task can be a noisy neighbor for another. Resources like CPU, memory and disk are isolated to each individual task. Any actions or failures which affect networking or fail the entire container only affect the single task running in it. No single user can overload the environment by triggering too many tasks, because there are no shared workers. +2. Customized environments: You can build different container images which incorporate specific dependencies (such as system level dependencies), binaries, or data required for a task to run. +3. Cost effective: Compute resources only exist for the lifetime of the Airflow task itself. This saves costs by not requiring persistent/long lived workers ready at all times, which also need maintenance and patching. + +For a quick start guide please see [here](Setup_guide.md), it will get you up and running with a basic configuration. + +The below sections provide more generic details about configuration, the provided example Dockerfile and logging. + +## Config Options + +There are a number of configuration options available, which can either be set directly in the airflow.cfg +file under an "aws_ecs_executor" section or via environment variables using the `AIRFLOW__AWS_ECS_EXECUTOR__<OPTION_NAME>` +format, for example `AIRFLOW__AWS_ECS_EXECUTOR__CONTAINER_NAME = "myEcsContainer"`. For more information +on how to set these options, see [Setting Configuration Options](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) + +In the case of conflicts, the order of precedence is: + +1. Load default values for options which have defaults. +2. Load any values provided in the RUN_TASK_KWARGS option if one is provided. +3. Load any values explicitly provided through airflow.cfg or environment variables. These are checked with Airflow's config precedence. + +### Required config options: + +- CLUSTER - Name of the Amazon ECS Cluster. Required. +- CONTAINER_NAME - Name of the container that will be used to execute Airflow tasks via the ECS executor. +The container should be specified in the ECS Task Definition. Required. +- REGION - The name of the AWS Region where Amazon ECS is configured. Required. + +### Optional config options: + +- ASSIGN_PUBLIC_IP - "Whether to assign a public IP address to the containers launched by the ECS executor. Defaults to "False". +- CONN_ID - The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. Defaults to "aws_default". +- LAUNCH_TYPE - Launch type can either be 'FARGATE' OR 'EC2'. Defaults to "FARGATE". +- PLATFORM_VERSION - The platform version the ECS task uses if the FARGATE launch type is used. Defaults to "LATEST". +- RUN_TASK_KWARGS - A JSON string containing arguments to provide the ECS `run_task` API. +- SECURITY_GROUPS - Up to 5 comma-seperated security group IDs associated with the ECS task. Defaults to the VPC default. +- SUBNETS - Up to 16 comma-separated subnet IDs associated with the ECS task or service. Defaults to the VPC default. +- TASK_DEFINITION - The family and revision (family:revision) or full ARN of the ECS task definition to run. Defaults to the latest ACTIVE revision. +- MAX_RUN_TASK_ATTEMPTS - The maximum number of times the Ecs Executor should attempt to run a task. + +For a more detailed description of available options, including type hints and examples, see the `config_templates` folder in the Amazon provider package. + +## Dockerfile for ECS Executor + +An example Dockerfile can be found [here](Dockerfile#), it creates an image that can be used on an ECS container to run Airflow tasks using the AWS ECS Executor in Apache Airflow. The image +supports AWS CLI/API integration, allowing you to interact with AWS services within your Airflow environment. It also includes options to load DAGs (Directed Acyclic Graphs) from either an S3 bucket or a local folder. + +### Base Image + +The Docker image is built upon the `apache/airflow:latest` image. See [here](https://hub.docker.com/r/apache/airflow) for more information about the image. + +Important note: The python version in this image must match the python version on the host/container which is running the Airflow scheduler process (which in turn runs the executor). The python version of the image can be verified by running the container, and printing the python version as follows: + +``` +docker run <image_name> python --version +``` + +Ensure that this version matches the python version of the host/container which is running the Airflow scheduler process (and thus, the ECS executor.) Apache Airflow images with specific python versions can be downloaded from the Dockerhub registry, and filtering tags by the [python version](https://hub.docker.com/r/apache/airflow/tags?page=1&name=3.8). For example, the tag `latest-python3.8` specifies that the image will have python 3.8 installed. + +### Prerequisites + +Docker must be installed on your system. Instructions for installing Docker can be found [here](https://docs.docker.com/get-docker/). + +### AWS Credentials + +The [AWS CLI](https://aws.amazon.com/cli/) is installed within the container, and there are multiple ways to pass AWS authentication information to the container. This guide will cover 2 methods. + +The first method is to use the build-time arguments (`aws_access_key_id`, `aws_secret_access_key`, `aws_default_region`, and `aws_session_token`). +To pass AWS authentication information using these arguments, use the `--build-arg` option during the Docker build process. For example: + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN . +``` + +Replace `YOUR_ACCESS_KEY`, `YOUR_SECRET_KEY`, `YOUR_SESSION_TOKEN`, and `YOUR_DEFAULT_REGION` with valid AWS credentials. + +Alternatively, you can authenticate to AWS using the `~/.aws` folder. See instructions on how to generate this folder [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Uncomment the line in the Dockerfile to copy the `./.aws` folder from your host machine to the container's `/home/airflow/.aws` directory. Keep in mind the Docker build context when copying the `.aws` folder to the container. + +### Loading DAGs + +There are many ways to load DAGs on the ECS container. This Dockerfile is preconfigured with two possible ways: copying from a local folder, or downloading from an S3 bucket. Other methods of loading DAGs are possible as well. + +#### From S3 Bucket + +To load DAGs from an S3 bucket, uncomment the entrypoint line in the Dockerfile to synchronize the DAGs from the specified S3 bucket to the `/opt/airflow/dags` directory inside the container. You can optionally provide `container_dag_path` as a build argument if you want to store the DAGs in a directory other than `/opt/airflow/dags`. + +Add `--build-arg s3_url=YOUR_S3_URL` in the docker build command. +Replace `YOUR_S3_URL` with the URL of your S3 bucket. Make sure you have the appropriate permissions to read from the bucket. + +Note that the following command is also passing in AWS credentials as build arguments. + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN \ + --build-arg s3_url=YOUR_S3_URL . +``` + + +#### From Local Folder + +To load DAGs from a local folder, place your DAG files in a folder within the docker build context on your host machine, and provide the location of the folder using the `host_dag_path` build argument. By default, the DAGs will be copied to `/opt/airflow/dags`, but this can be changed by passing the `container_dag_path` build-time argument during the Docker build process: + +``` +docker build -t my-airflow-image --build-arg host_dag_path=./dags_on_host --build-arg container_dag_path=/path/on/container . +``` + +If choosing to load DAGs onto a different path than `/opt/airflow/dags`, then the new path will need to be updated in the Airflow config. + +#### Mounting a Volume + +You can optionally mount a local directory as a volume on the container during run-time. This will allow you to make change to files on the mounted directory, and have those changes be reflected in the container. To do this, run the following command: + +``` +docker run --volume /abs/path/to/local/dir:/abs/path/to/remote/dir <image_name> +``` + +Note: Doing this will overwrite the contents of the directory on the container with the contents of the local directory. + +### Installing Python Dependencies + +This Dockerfile supports installing Python dependencies via `pip` from a `requirements.txt` file. Place your `requirements.txt` file in the same directory as the Dockerfile. If it is in a different location, it can be specified using the `requirements_path` build-argument. Keep in mind the Docker context when copying the `requirements.txt` file. Uncomment the two appropriate lines in the Dockerfile that copy the `requirements.txt` file to the container, and run `pip install` to install the dependencies on the container. + +### Building Image for ECS Executor + +Detailed instructions on how to use the Docker image, that you have created via this readme, with the ECS Executor can be found [here](link_to_how_to_guide). + + +## Logging + +Airflow tasks executed via this executor run in ECS containers within the configured VPC. This means that logs are not directly accessible to the Airflow Webserver and when containers are stopped, after task completion, the logs would be permanently lost. + +Remote logging can be employed when using the ECS executor to persist your Airflow Task logs and make them viewable from the Airflow Webserver. + +### Configuring Remote Logging + +There are many ways to configure remote logging and several supported destinations. A general overview of Airflow Task logging can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-tasks.html). Instructions for configuring S3 remote logging can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/s3-task-handler.html) and Cloudwatch remote logging [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/cloud-watch-task-handlers.html). +Some important things to point out for remote logging in the context of the ECS executor: + + - The configuration options for Airflow remote logging must be configured on the host running the Airflow Webserver (so that it can fetch logs from the remote location) as well as within the ECS container running the Airflow Tasks (so that it can upload the logs to the remote location). See [here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) to read more about how to set Airflow configuration via config file or environment variable exports. + - Adding the Airflow remote logging config to the container can be done in many ways. Some examples include, but are not limited to: Review Comment: Related to above, we should set the expectation that however Airflow overall is configured, is also used for these ECS containers. Suggesting bespoke config for the tasks is asking for problems eventually imo. ########## airflow/providers/amazon/aws/executors/ecs/README.md: ########## @@ -0,0 +1,196 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + --> + +# AWS ECS Executor + +This is an Airflow executor powered by Amazon Elastic Container Service (ECS). Each task that Airflow schedules for execution is run within its own ECS container. Some benefits of an executor like this include: + +1. Task isolation: No task can be a noisy neighbor for another. Resources like CPU, memory and disk are isolated to each individual task. Any actions or failures which affect networking or fail the entire container only affect the single task running in it. No single user can overload the environment by triggering too many tasks, because there are no shared workers. +2. Customized environments: You can build different container images which incorporate specific dependencies (such as system level dependencies), binaries, or data required for a task to run. +3. Cost effective: Compute resources only exist for the lifetime of the Airflow task itself. This saves costs by not requiring persistent/long lived workers ready at all times, which also need maintenance and patching. + +For a quick start guide please see [here](Setup_guide.md), it will get you up and running with a basic configuration. + +The below sections provide more generic details about configuration, the provided example Dockerfile and logging. + +## Config Options + +There are a number of configuration options available, which can either be set directly in the airflow.cfg +file under an "aws_ecs_executor" section or via environment variables using the `AIRFLOW__AWS_ECS_EXECUTOR__<OPTION_NAME>` +format, for example `AIRFLOW__AWS_ECS_EXECUTOR__CONTAINER_NAME = "myEcsContainer"`. For more information +on how to set these options, see [Setting Configuration Options](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) + +In the case of conflicts, the order of precedence is: + +1. Load default values for options which have defaults. +2. Load any values provided in the RUN_TASK_KWARGS option if one is provided. +3. Load any values explicitly provided through airflow.cfg or environment variables. These are checked with Airflow's config precedence. + +### Required config options: + +- CLUSTER - Name of the Amazon ECS Cluster. Required. +- CONTAINER_NAME - Name of the container that will be used to execute Airflow tasks via the ECS executor. +The container should be specified in the ECS Task Definition. Required. +- REGION - The name of the AWS Region where Amazon ECS is configured. Required. + +### Optional config options: + +- ASSIGN_PUBLIC_IP - "Whether to assign a public IP address to the containers launched by the ECS executor. Defaults to "False". +- CONN_ID - The Airflow connection (i.e. credentials) used by the ECS executor to make API calls to AWS ECS. Defaults to "aws_default". +- LAUNCH_TYPE - Launch type can either be 'FARGATE' OR 'EC2'. Defaults to "FARGATE". +- PLATFORM_VERSION - The platform version the ECS task uses if the FARGATE launch type is used. Defaults to "LATEST". +- RUN_TASK_KWARGS - A JSON string containing arguments to provide the ECS `run_task` API. +- SECURITY_GROUPS - Up to 5 comma-seperated security group IDs associated with the ECS task. Defaults to the VPC default. +- SUBNETS - Up to 16 comma-separated subnet IDs associated with the ECS task or service. Defaults to the VPC default. +- TASK_DEFINITION - The family and revision (family:revision) or full ARN of the ECS task definition to run. Defaults to the latest ACTIVE revision. +- MAX_RUN_TASK_ATTEMPTS - The maximum number of times the Ecs Executor should attempt to run a task. + +For a more detailed description of available options, including type hints and examples, see the `config_templates` folder in the Amazon provider package. + +## Dockerfile for ECS Executor + +An example Dockerfile can be found [here](Dockerfile#), it creates an image that can be used on an ECS container to run Airflow tasks using the AWS ECS Executor in Apache Airflow. The image +supports AWS CLI/API integration, allowing you to interact with AWS services within your Airflow environment. It also includes options to load DAGs (Directed Acyclic Graphs) from either an S3 bucket or a local folder. + +### Base Image + +The Docker image is built upon the `apache/airflow:latest` image. See [here](https://hub.docker.com/r/apache/airflow) for more information about the image. + +Important note: The python version in this image must match the python version on the host/container which is running the Airflow scheduler process (which in turn runs the executor). The python version of the image can be verified by running the container, and printing the python version as follows: + +``` +docker run <image_name> python --version +``` + +Ensure that this version matches the python version of the host/container which is running the Airflow scheduler process (and thus, the ECS executor.) Apache Airflow images with specific python versions can be downloaded from the Dockerhub registry, and filtering tags by the [python version](https://hub.docker.com/r/apache/airflow/tags?page=1&name=3.8). For example, the tag `latest-python3.8` specifies that the image will have python 3.8 installed. + +### Prerequisites + +Docker must be installed on your system. Instructions for installing Docker can be found [here](https://docs.docker.com/get-docker/). + +### AWS Credentials + +The [AWS CLI](https://aws.amazon.com/cli/) is installed within the container, and there are multiple ways to pass AWS authentication information to the container. This guide will cover 2 methods. + +The first method is to use the build-time arguments (`aws_access_key_id`, `aws_secret_access_key`, `aws_default_region`, and `aws_session_token`). +To pass AWS authentication information using these arguments, use the `--build-arg` option during the Docker build process. For example: + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN . +``` + +Replace `YOUR_ACCESS_KEY`, `YOUR_SECRET_KEY`, `YOUR_SESSION_TOKEN`, and `YOUR_DEFAULT_REGION` with valid AWS credentials. + +Alternatively, you can authenticate to AWS using the `~/.aws` folder. See instructions on how to generate this folder [here](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). Uncomment the line in the Dockerfile to copy the `./.aws` folder from your host machine to the container's `/home/airflow/.aws` directory. Keep in mind the Docker build context when copying the `.aws` folder to the container. + +### Loading DAGs + +There are many ways to load DAGs on the ECS container. This Dockerfile is preconfigured with two possible ways: copying from a local folder, or downloading from an S3 bucket. Other methods of loading DAGs are possible as well. + +#### From S3 Bucket + +To load DAGs from an S3 bucket, uncomment the entrypoint line in the Dockerfile to synchronize the DAGs from the specified S3 bucket to the `/opt/airflow/dags` directory inside the container. You can optionally provide `container_dag_path` as a build argument if you want to store the DAGs in a directory other than `/opt/airflow/dags`. + +Add `--build-arg s3_url=YOUR_S3_URL` in the docker build command. +Replace `YOUR_S3_URL` with the URL of your S3 bucket. Make sure you have the appropriate permissions to read from the bucket. + +Note that the following command is also passing in AWS credentials as build arguments. + +``` +docker build -t my-airflow-image \ + --build-arg aws_access_key_id=YOUR_ACCESS_KEY \ + --build-arg aws_secret_access_key=YOUR_SECRET_KEY \ + --build-arg aws_default_region=YOUR_DEFAULT_REGION \ + --build-arg aws_session_token=YOUR_SESSION_TOKEN \ + --build-arg s3_url=YOUR_S3_URL . +``` + + +#### From Local Folder + +To load DAGs from a local folder, place your DAG files in a folder within the docker build context on your host machine, and provide the location of the folder using the `host_dag_path` build argument. By default, the DAGs will be copied to `/opt/airflow/dags`, but this can be changed by passing the `container_dag_path` build-time argument during the Docker build process: + +``` +docker build -t my-airflow-image --build-arg host_dag_path=./dags_on_host --build-arg container_dag_path=/path/on/container . +``` + +If choosing to load DAGs onto a different path than `/opt/airflow/dags`, then the new path will need to be updated in the Airflow config. + +#### Mounting a Volume + +You can optionally mount a local directory as a volume on the container during run-time. This will allow you to make change to files on the mounted directory, and have those changes be reflected in the container. To do this, run the following command: + +``` +docker run --volume /abs/path/to/local/dir:/abs/path/to/remote/dir <image_name> +``` + +Note: Doing this will overwrite the contents of the directory on the container with the contents of the local directory. + +### Installing Python Dependencies + +This Dockerfile supports installing Python dependencies via `pip` from a `requirements.txt` file. Place your `requirements.txt` file in the same directory as the Dockerfile. If it is in a different location, it can be specified using the `requirements_path` build-argument. Keep in mind the Docker context when copying the `requirements.txt` file. Uncomment the two appropriate lines in the Dockerfile that copy the `requirements.txt` file to the container, and run `pip install` to install the dependencies on the container. + +### Building Image for ECS Executor + +Detailed instructions on how to use the Docker image, that you have created via this readme, with the ECS Executor can be found [here](link_to_how_to_guide). + + +## Logging + +Airflow tasks executed via this executor run in ECS containers within the configured VPC. This means that logs are not directly accessible to the Airflow Webserver and when containers are stopped, after task completion, the logs would be permanently lost. + +Remote logging can be employed when using the ECS executor to persist your Airflow Task logs and make them viewable from the Airflow Webserver. + +### Configuring Remote Logging + +There are many ways to configure remote logging and several supported destinations. A general overview of Airflow Task logging can be found [here](https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/logging-tasks.html). Instructions for configuring S3 remote logging can be found [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/s3-task-handler.html) and Cloudwatch remote logging [here](https://airflow.apache.org/docs/apache-airflow-providers-amazon/stable/logging/cloud-watch-task-handlers.html). +Some important things to point out for remote logging in the context of the ECS executor: + + - The configuration options for Airflow remote logging must be configured on the host running the Airflow Webserver (so that it can fetch logs from the remote location) as well as within the ECS container running the Airflow Tasks (so that it can upload the logs to the remote location). See [here](https://airflow.apache.org/docs/apache-airflow/stable/howto/set-config.html) to read more about how to set Airflow configuration via config file or environment variable exports. + - Adding the Airflow remote logging config to the container can be done in many ways. Some examples include, but are not limited to: Review Comment: I'll add, with KubernetesExecutor, it's assumed that the "deployment manager" handles this themselves in the pod_template_file. We should use a similar approach for our example here. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
