mshober commented on PR #34381: URL: https://github.com/apache/airflow/pull/34381#issuecomment-1796224734
Hi @o-nikolas. Just noticed this PR today. I'd like to give my feedback on this. I've been using a fork of @aelzeiny's ECS executor for my company which runs about 180k tasks per day on our Airflow environment. We don't use Fargate; it is far too slow and expensive for the amount of tasks that we run. The original ECS Executor was very coupled to Fargate so much of the work I did was related to removing the Fargate specifics. I'd love to use Airflow's official ECS Executor instead of having to support my own, but the current implementation is not suitable for my use case (and likely others). Here's a few features that are must-haves for me: ### Supporting Capacity Providers The ECS Executor in this PR does not support Capacity Providers. It requires that users specify a [LaunchType](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-launchType), and if you're specifying a LaunchType then you cannot specify a Capacity Provider. Using Capacity Providers is essential for organizations that use EC2-hosted ECS Clusters. If I can recall correctly, running tasks with EC2 LaunchType will result in launch failures if there is not enough capacity to run them, whereas tasks ran with Capacity Providers will go into a Provisioning state until there is enough capacity available to run the task. Capacity Providers are also very cost efficient since you can use them to dynamically scale your EC2 instances based on demand. ### Overriding Additional ECS Task Properties The executor config [is scoped to overrides.containerOverrides](https://github.com/Joffreybvn/airflow/commit/8518785ea1078552d1d2bffe4543b927f67f030d#diff-127473a028d510b65de4dd84962b31ab703fbd821f9e871dd296b5c835ee3eebR264-R282). However there are relevant properties outside of `overrides.containerOverrides` that users may want to change. For example, our ECS Cluster is actually composed of 3 capacity providers: A General-Purpose Capacity Provider (which is our cluster's default provider and runs on M7g instances), Memory-Optimized (R7g instances) and Compute-Optimized (C7g instances). My version of the ECS Executor allows users to set the appropriate Capacity Provider via the operator's `executor_config` param so that we can run our jobs in the most cost-efficient environment. There are several other properties which airflow uses may want to set on a task-level, such as: - [overrides.cpu](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-cpu) - [overrides.memory](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_TaskOverride.html#ECS-Type-TaskOverride-memory) - [networkConfiguration](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-networkConfiguration) - [tags](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-tags) - [placementConstraints](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementConstraints) - [placementStrategy](https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_RunTask.html#ECS-RunTask-request-placementStrategy) ### Adopting Task Instances This is a must-have for us, as our deployments replace our scheduler instances. There is a [PR](https://github.com/aelzeiny/airflow-aws-executors/pull/15) for that feature in @aelzeiny's executor. I had to make some changes to get that working properly. I can assist on a PR for this feature. ### Increasing Throughput The ECS Executor calls the ECS RunTask API sequentially. On our current environment, this leads to a maximum throughput of roughly 4 tasks launched per second per scheduler instance. This can cause issues for larger airflow environments like my own, for example: - During peak times tasks often spend a long period in the scheduled state despite there being available capacity in the environment. - Larger values for `max_tis_per_query` can lead to missed heartbeats from the length of time the Executor is spent calling the RunTask API. I haven't had a chance to implement an improvement for this in my own executor yet, but my thinking was to incorporate the same `sync_parallelism` logic that is currently used for the `CeleryExecutor`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
