olivermeyer opened a new issue #16299: URL: https://github.com/apache/airflow/issues/16299
**Apache Airflow version**: 1.10.15 (also applies to 2.X) **Environment**: - **Cloud provider or hardware configuration**: Astronomer deployment **What happened**: I am currently upgrading an Airflow deployment from 1.10.15 to 2.1.0. While doing so, I switched over from `airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator` to `airflow.providers.amazon.aws.operators.sagemaker_training.SageMakerTrainingOperator`, and found that some DAGs started failing at every run after that. I dug into the issue a bit, and found that the problem comes from the operator listing existing training jobs ([here](https://github.com/apache/airflow/blob/db63de626f53c9e0242f0752bb996d0e32ebf6ea/airflow/providers/amazon/aws/operators/sagemaker_training.py#L95)). This method calls `boto3`'s `list_training_jobs` over and over again, enough times to get rate limited with a single operator if the number of existing training jobs is high enough - AWS does not allow to delete existing jobs, so these can easily be in the hundreds if not more. Since the operator does not allow to pass `max_results` to the hook's method (although the method can take it), the default page size is used (10) and the number of requests can explode. With a single operator, I was able to mitigate the issue by using the standard retry handler (instead of the legacy handler) - see [doc](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html). However, even the standard retry handler does not help in our case, where we have a dozen operators firing at the same time. All of them got rate limited again, and I was unable to make the job succeed. One technical fix would be to use a dedicated pool with 1 slot, thereby effectively running all training jobs sequentially. However that won't do in the real world: SageMaker jobs are often long-running, and we cannot afford to go from 1-2 hours when executing them in parallel, to 10-20 hours sequentially. I believe (and this is somewhat opinionated) that the `SageMakerTrainingOperator` should **not** be responsible for renaming jobs, for two reasons: (1) single responsibility principle (my operator should trigger a SageMaker job, not figure out the correct name + trigger it); (2) alignment between operators and the systems they interact with: running this list operation is, until AWS allows to somehow delete old jobs and/or dramatically increases rate limits, not aligned with the way AWS works. **What you expected to happen**: The `SageMakerTrainingOperator` should not be limited in parallelism by the number of existing training jobs in AWS. This limitation is a side-effect of listing existing training jobs. Therefore, the `SageMakerTrainingOperator` should not list existing training jobs. **Anything else we need to know**: <details><summary>x.log</summary>Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 984, in _run_raw_task result = task_copy.execute(context=context) File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/sagemaker_training.py", line 97, in execute training_jobs = self.hook.list_training_jobs(name_contains=training_job_name) File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/sagemaker.py", line 888, in list_training_jobs list_training_jobs_request, "TrainingJobSummaries", max_results=max_results File "/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/sagemaker.py", line 945, in _list_request response = partial_func(**kwargs) File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 337, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 656, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (ThrottlingException) when calling the ListTrainingJobs operation (reached max retries: 9): Rate exceeded</details> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
