Re: [PR] AWS ECS Executor [airflow]

via GitHub Mon, 06 Nov 2023 11:49:38 -0800


mshober commented on PR #34381:
URL: https://github.com/apache/airflow/pull/34381#issuecomment-1796224734

Hi @o-nikolas. Just noticed this PR today. I'd like to give my feedback on
this.

I've been using a fork of @aelzeiny's ECS executor for my company which runs
about 180k tasks per day on our Airflow environment. We don't use Fargate; it
is far too slow and expensive for the amount of tasks that we run. The original
ECS Executor was very coupled to Fargate so much of the work I did was related
to removing the Fargate specifics.

I'd love to use Airflow's official ECS Executor instead of having to support
my own, but the current implementation is not suitable for my use case (and
likely others).

Here's a few features that are must-haves for me:

### Supporting Capacity Providers
The ECS Executor in this PR does not support Capacity Providers. It requires
that users specify a
[LaunchType](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-launchType),
and if you're specifying a LaunchType then you cannot specify a Capacity
Provider.

Using Capacity Providers is essential for organizations that use EC2-hosted
ECS Clusters. If I can recall correctly, running tasks with EC2 LaunchType will
result in launch failures if there is not enough capacity to run them, whereas
tasks ran with Capacity Providers will go into a Provisioning state until there
is enough capacity available to run the task. Capacity Providers are also very
cost efficient since you can use them to dynamically scale your EC2 instances
based on demand.

### Overriding Additional ECS Task Properties
The executor config [is scoped to
overrides.containerOverrides](https://github.com/Joffreybvn/airflow/commit/8518785ea1078552d1d2bffe4543b927f67f030d#diff-127473a028d510b65de4dd84962b31ab703fbd821f9e871dd296b5c835ee3eebR264-R282).
However there are relevant properties outside of
`overrides.containerOverrides` that users may want to change.

For example, our ECS Cluster is actually composed of 3 capacity providers: A
General-Purpose Capacity Provider (which is our cluster's default provider and
runs on M7g instances), Memory-Optimized (R7g instances) and Compute-Optimized
(C7g instances). My version of the ECS Executor allows users to set the
appropriate Capacity Provider via the operator's `executor_config` param so
that we can run our jobs in the most cost-efficient environment.

There are several other properties which airflow uses may want to set on a
task-level, such as:
-
[overrides.cpu](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-cpu)
-
[overrides.memory](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-memory)
-
[networkConfiguration](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-networkConfiguration)
-
[tags](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-tags)
-
[placementConstraints](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementConstraints)
-
[placementStrategy](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementStrategy)

### Adopting Task Instances
This is a must-have for us, as our deployments replace our scheduler
instances.

There is a [PR](https://github.com/aelzeiny/airflow-aws-executors/pull/15)
for that feature in @aelzeiny's executor. I had to make some changes to get
that working properly. I can assist on a PR for this feature.

### Increasing Throughput
The ECS Executor calls the ECS RunTask API sequentially. On our current
environment, this leads to a maximum throughput of roughly 4 tasks launched per
second per scheduler instance. This can cause issues for larger airflow
environments like my own, for example:
- During peak times tasks often spend a long period in the scheduled state
despite there being available capacity in the environment.
- Larger values for `max_tis_per_query` can lead to missed heartbeats from
the length of time the Executor is spent calling the RunTask API.

I haven't had a chance to implement an improvement for this in my own
executor yet, but my thinking was to incorporate the same `sync_parallelism`
logic that is currently used for the `CeleryExecutor`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] AWS ECS Executor [airflow]

Reply via email to