vandonr-amz opened a new pull request, #29245: URL: https://github.com/apache/airflow/pull/29245
I was looking at this code because we got a throttling error in a system test. The error was coming from the list_ operation on the job names. This boto call has a cap at 100 results per call, and the function wanted all the jobs, so we were doing repeated calls with the `NextToken` set. Looking online, it seems boto starts to throw throttling errors around 70/80 chained calls like this with the default config (i.e. around 7-8000 jobs). The problem is that sagemaker jobs cannot be "cleaned", they stay there forever, so listing all of them is an increasingly longer operation, doomed to fail. Looking further, it turned out that we didn't _really_ needed to list all the jobs. It was used for 2 things: checking if the job name already existed (which can be done in a single `describe` call), and get a number to append to the job name if needed. To make this a lot cheaper and not O(nb jobs), I propose that we use a random number instead of a monotonically increasing sequence to rename jobs in case of collision. The sequence was already imperfect, because we counted all jobs, not just the ones with the same base name, so it'd go like: 1. job_name 2. job_name-2 3. other_job_name 4. other_job_name-4 5. job_name-5 6. etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
