[
https://issues.apache.org/jira/browse/HADOOP-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232676#comment-17232676
]
Steve Loughran commented on HADOOP-17377:
-----------------------------------------
[~snvijaya]
> ABFS: Frequent HTTP429 exceptions with MSI token provider
> ---------------------------------------------------------
>
> Key: HADOOP-17377
> URL: https://issues.apache.org/jira/browse/HADOOP-17377
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/azure
> Affects Versions: 3.2.1
> Reporter: Brandon
> Priority: Major
>
> *Summary*
> The MSI token provider fetches auth tokens from the local instance metadata
> service.
> The instance metadata service documentation states a limit of 5 requests per
> second:
> [https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service#error-and-debugging]
> which is fairly low.
> Using ABFS and the MSI token provider, especially when there are multiple
> JVMs running on the same host, ABFS frequently throws HTTP429 throttled
> exception. The implementation for fetching a token from MSI uses
> ExponentialRetryPolicy, however ExponentialRetryPolicy does not retry on
> status code 429, from my read of the code.
> Perhaps the ExponentialRetryPolicy could retry HTTP429 errors? I'm not sure
> what other ramifications that would have.
> *Environment*
> This is in the context of Spark clusters running on Azure Virtual Machine
> Scale Sets. The Virtual Machine Scale Set is configured with a user-assigned
> identity. The Spark cluster is configured to download application JARs from
> an `abfs://` path, and auth to the storage account with the MSI token
> provider. The Spark version is 2.4.4. Hadoop libraries are version 3.2.1.
> More details on the Spark configuration: each VM runs 6 executor processes,
> and each executor process uses 5 cores. The FileSystem objects are singletons
> within each JVM due to the internal cache, so on each VM, I expect my setup
> is making 6 rapid requests to the instance metadata service when the executor
> is starting up and fetching the JAR.
> *Impact*
> In my particular use case, the download operation itself is wrapped with 3
> additional retries. I have never seen the download cause all the tries to be
> exhausted and fail. In the end, it seems to contribute mostly noise and
> slowness from the retries. However, having the HTTP429 handled robustly in
> the ABFS implementation would help application developers succeed and write
> cleaner code without wrapping individual ABFS operations with retries.
> *Example*
> Here's an example error message and stack trace. It's always the same stack
> trace. This appears in my logs a few hundred to low thousands of times a day.
> {noformat}
> AADToken: HTTP connection failed for getting token from AzureAD. Http
> response: 429 null
> Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:
> Proxies: none
> First 1K of Body: {"error":"invalid_request","error_description":"Temporarily
> throttled, too many requests"}
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
> at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
> at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> at
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
> at
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
> at
> scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
> at
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
> at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
> at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
> at
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){noformat}
> CC [~mackrorysd], [[email protected]]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]