olivermeyer opened a new issue #16299:
URL: https://github.com/apache/airflow/issues/16299


   **Apache Airflow version**: 1.10.15 (also applies to 2.X)
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: Astronomer deployment
   
   **What happened**:
   
   I am currently upgrading an Airflow deployment from 1.10.15 to 2.1.0. While 
doing so, I switched over from 
`airflow.contrib.operators.sagemaker_training_operator.SageMakerTrainingOperator`
 to 
`airflow.providers.amazon.aws.operators.sagemaker_training.SageMakerTrainingOperator`,
 and found that some DAGs started failing at every run after that.
   
   I dug into the issue a bit, and found that the problem comes from the 
operator listing existing training jobs 
([here](https://github.com/apache/airflow/blob/db63de626f53c9e0242f0752bb996d0e32ebf6ea/airflow/providers/amazon/aws/operators/sagemaker_training.py#L95)).
 This method calls `boto3`'s `list_training_jobs` over and over again, enough 
times to get rate limited with a single operator if the number of existing 
training jobs is high enough - AWS does not allow to delete existing jobs, so 
these can easily be in the hundreds if not more. Since the operator does not 
allow to pass `max_results` to the hook's method (although the method can take 
it), the default page size is used (10) and the number of requests can explode. 
With a single operator, I was able to mitigate the issue by using the standard 
retry handler (instead of the legacy handler) - see 
[doc](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html).
 However, even the standard retry handler does not
  help in our case, where we have a dozen operators firing at the same time. 
All of them got rate limited again, and I was unable to make the job succeed.
   
   One technical fix would be to use a dedicated pool with 1 slot, thereby 
effectively running all training jobs sequentially. However that won't do in 
the real world: SageMaker jobs are often long-running, and we cannot afford to 
go from 1-2 hours when executing them in parallel, to 10-20 hours sequentially.
   
   I believe (and this is somewhat opinionated) that the 
`SageMakerTrainingOperator` should **not** be responsible for renaming jobs, 
for two reasons: (1) single responsibility principle (my operator should 
trigger a SageMaker job, not figure out the correct name + trigger it); (2) 
alignment between operators and the systems they interact with: running this 
list operation is, until AWS allows to somehow delete old jobs and/or 
dramatically increases rate limits, not aligned with the way AWS works.
   
   **What you expected to happen**:
   
   The `SageMakerTrainingOperator` should not be limited in parallelism by the 
number of existing training jobs in AWS. This limitation is a side-effect of 
listing existing training jobs. Therefore, the `SageMakerTrainingOperator` 
should not list existing training jobs.
   
   **Anything else we need to know**:
   
   <details><summary>x.log</summary>Traceback (most recent call last):
     File 
"/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 
984, in _run_raw_task
       result = task_copy.execute(context=context)
     File 
"/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/operators/sagemaker_training.py",
 line 97, in execute
       training_jobs = 
self.hook.list_training_jobs(name_contains=training_job_name)
     File 
"/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/sagemaker.py",
 line 888, in list_training_jobs
       list_training_jobs_request, "TrainingJobSummaries", 
max_results=max_results
     File 
"/usr/local/lib/python3.7/site-packages/airflow/providers/amazon/aws/hooks/sagemaker.py",
 line 945, in _list_request
       response = partial_func(**kwargs)
     File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 
337, in _api_call
       return self._make_api_call(operation_name, kwargs)
     File "/usr/local/lib/python3.7/site-packages/botocore/client.py", line 
656, in _make_api_call
       raise error_class(parsed_response, operation_name)
   botocore.exceptions.ClientError: An error occurred (ThrottlingException) 
when calling the ListTrainingJobs operation (reached max retries: 9): Rate 
exceeded</details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to