mshober commented on PR #34381:
URL: https://github.com/apache/airflow/pull/34381#issuecomment-1796224734

   Hi @o-nikolas. Just noticed this PR today. I'd like to give my feedback on 
this.
   
   I've been using a fork of @aelzeiny's ECS executor for my company which runs 
about 180k tasks per day on our Airflow environment. We don't use Fargate; it 
is far too slow and expensive for the amount of tasks that we run. The original 
ECS Executor was very coupled to Fargate so much of the work I did was related 
to removing the Fargate specifics.
   
   I'd love to use Airflow's official ECS Executor instead of having to support 
my own, but the current implementation is not suitable for my use case (and 
likely others). 
   
   Here's a few features that are must-haves for me:
   
   ### Supporting Capacity Providers
   The ECS Executor in this PR does not support Capacity Providers. It requires 
that users specify a 
[LaunchType](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-launchType),
 and if you're specifying a LaunchType then you cannot specify a Capacity 
Provider. 
   
   Using Capacity Providers is essential for organizations that use EC2-hosted 
ECS Clusters. If I can recall correctly, running tasks with EC2 LaunchType will 
result in launch failures if there is not enough capacity to run them, whereas 
tasks ran with Capacity Providers will go into a Provisioning state until there 
is enough capacity available to run the task. Capacity Providers are also very 
cost efficient since you can use them to dynamically scale your EC2 instances 
based on demand.
   
   ### Overriding Additional ECS Task Properties
   The executor config [is scoped to 
overrides.containerOverrides](https://github.com/Joffreybvn/airflow/commit/8518785ea1078552d1d2bffe4543b927f67f030d#diff-127473a028d510b65de4dd84962b31ab703fbd821f9e871dd296b5c835ee3eebR264-R282).
 However there are relevant properties outside of 
`overrides.containerOverrides` that users may want to change.
   
   For example, our ECS Cluster is actually composed of 3 capacity providers: A 
General-Purpose Capacity Provider (which is our cluster's default provider and 
runs on M7g instances), Memory-Optimized (R7g instances) and Compute-Optimized 
(C7g instances). My version of the ECS Executor allows users to set the 
appropriate Capacity Provider via the operator's `executor_config` param so 
that we can run our jobs in the most cost-efficient environment.
   
   There are several other properties which airflow uses may want to set on a 
task-level, such as:
   - 
[overrides.cpu](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-cpu)
   - 
[overrides.memory](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-memory)
   - 
[networkConfiguration](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-networkConfiguration)
   - 
[tags](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-tags)
   - 
[placementConstraints](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementConstraints)
   - 
[placementStrategy](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementStrategy)
   
   ### Adopting Task Instances 
   This is a must-have for us, as our deployments replace our scheduler 
instances. 
   
   There is a [PR](https://github.com/aelzeiny/airflow-aws-executors/pull/15) 
for that feature in @aelzeiny's executor. I had to make some changes to get 
that working properly. I can assist on a PR for this feature.
   
   ### Increasing Throughput
   The ECS Executor calls the ECS RunTask API sequentially. On our current 
environment, this leads to a maximum throughput of roughly 4 tasks launched per 
second per scheduler instance. This can cause issues for larger airflow 
environments like my own, for example:
   - During peak times tasks often spend a long period in the scheduled state 
despite there being available capacity in the environment.
   - Larger values for `max_tis_per_query` can lead to missed heartbeats from 
the length of time the Executor is spent calling the RunTask API.
   
   I haven't had a chance to implement an improvement for this in my own 
executor yet, but my thinking was to incorporate the same `sync_parallelism` 
logic that is currently used for the `CeleryExecutor`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to