[
https://issues.apache.org/jira/browse/HADOOP-17377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Brandon updated HADOOP-17377:
-----------------------------
Description:
*Summary*
The instance metadata service has its own guidance for error handling and retry
which are different from the Blob store.
[https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#error-handling]
In particular, it responds with HTTP 429 if request rate is too high. Whereas
Blob store will respond with HTTP 503. The retry policy used only accounts for
the latter. This can result in job instability when running multiple processes
on the same host.
*Environment*
* Spark talking to an ABFS store
* Hadoop 3.2.1
* Running on an Azure VM with user-assigned identity, ABFS configured to use
MsiTokenProvider
* 6 executor processes on each VM
*Example*
Here's an example error message and stack trace. It's always the same stack
trace. This appears in my logs a few hundred to low thousands of times a day.
{noformat}
AADToken: HTTP connection failed for getting token from AzureAD. Http response:
429 null
Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:
Proxies: none
First 1K of Body: {"error":"invalid_request","error_description":"Temporarily
throttled, too many requests"}
at
org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
at
org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
at
org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
at
org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
at
org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
at
org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
at
scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
at
scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
at
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
CC [~mackrorysd], [[email protected]]
was:
*Summary*
The MSI token provider fetches auth tokens from the local instance metadata
service.
The instance metadata service documentation states a limit of 5 requests per
second:
[https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service#error-and-debugging]
which is fairly low.
Using ABFS and the MSI token provider, especially when there are multiple JVMs
running on the same host, ABFS frequently throws HTTP429 throttled exception.
The implementation for fetching a token from MSI uses ExponentialRetryPolicy,
however ExponentialRetryPolicy does not retry on status code 429, from my read
of the code.
Perhaps the ExponentialRetryPolicy could retry HTTP429 errors? I'm not sure
what other ramifications that would have.
*Environment*
This is in the context of Spark clusters running on Azure Virtual Machine
Scale Sets. The Virtual Machine Scale Set is configured with a user-assigned
identity. The Spark cluster is configured to download application JARs from an
`abfs://` path, and auth to the storage account with the MSI token provider.
The Spark version is 2.4.4. Hadoop libraries are version 3.2.1. More details on
the Spark configuration: each VM runs 6 executor processes, and each executor
process uses 5 cores. The FileSystem objects are singletons within each JVM due
to the internal cache, so on each VM, I expect my setup is making 6 rapid
requests to the instance metadata service when the executor is starting up and
fetching the JAR.
*Impact*
In my particular use case, the download operation itself is wrapped with 3
additional retries. I have never seen the download cause all the tries to be
exhausted and fail. In the end, it seems to contribute mostly noise and
slowness from the retries. However, having the HTTP429 handled robustly in the
ABFS implementation would help application developers succeed and write cleaner
code without wrapping individual ABFS operations with retries.
*Example*
Here's an example error message and stack trace. It's always the same stack
trace. This appears in my logs a few hundred to low thousands of times a day.
{noformat}
AADToken: HTTP connection failed for getting token from AzureAD. Http response:
429 null
Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:
Proxies: none
First 1K of Body: {"error":"invalid_request","error_description":"Temporarily
throttled, too many requests"}
at
org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
at
org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
at
org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
at
org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
at
org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
at
org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
at
org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
at
scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
at
scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
at
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
at
org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748){noformat}
CC [~mackrorysd], [[email protected]]
> ABFS: MsiTokenProvider doesn't retry HTTP 429 from the Instance Metadata
> Service
> --------------------------------------------------------------------------------
>
> Key: HADOOP-17377
> URL: https://issues.apache.org/jira/browse/HADOOP-17377
> Project: Hadoop Common
> Issue Type: Bug
> Components: fs/azure
> Affects Versions: 3.2.1
> Reporter: Brandon
> Priority: Major
>
> *Summary*
> The instance metadata service has its own guidance for error handling and
> retry which are different from the Blob store.
> [https://docs.microsoft.com/en-us/azure/active-directory/managed-identities-azure-resources/how-to-use-vm-token#error-handling]
>
> In particular, it responds with HTTP 429 if request rate is too high. Whereas
> Blob store will respond with HTTP 503. The retry policy used only accounts
> for the latter. This can result in job instability when running multiple
> processes on the same host.
> *Environment*
> * Spark talking to an ABFS store
> * Hadoop 3.2.1
> * Running on an Azure VM with user-assigned identity, ABFS configured to use
> MsiTokenProvider
> * 6 executor processes on each VM
> *Example*
> Here's an example error message and stack trace. It's always the same stack
> trace. This appears in my logs a few hundred to low thousands of times a day.
> {noformat}
> AADToken: HTTP connection failed for getting token from AzureAD. Http
> response: 429 null
> Content-Type: application/json; charset=utf-8 Content-Length: 90 Request ID:
> Proxies: none
> First 1K of Body: {"error":"invalid_request","error_description":"Temporarily
> throttled, too many requests"}
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.executeHttpOperation(AbfsRestOperation.java:190)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsRestOperation.execute(AbfsRestOperation.java:125)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:506)
> at
> org.apache.hadoop.fs.azurebfs.services.AbfsClient.getAclStatus(AbfsClient.java:489)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getIsNamespaceEnabled(AzureBlobFileSystemStore.java:208)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.getFileStatus(AzureBlobFileSystemStore.java:473)
> at
> org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.getFileStatus(AzureBlobFileSystem.java:437)
> at org.apache.hadoop.fs.FileSystem.isFile(FileSystem.java:1717)
> at org.apache.spark.util.Utils$.fetchHcfsFile(Utils.scala:747)
> at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:724)
> at org.apache.spark.util.Utils$.fetchFile(Utils.scala:496)
> at
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7(Executor.scala:812)
> at
> org.apache.spark.executor.Executor.$anonfun$updateDependencies$7$adapted(Executor.scala:803)
> at
> scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:792)
> at
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149)
> at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
> at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
> at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
> at scala.collection.mutable.HashMap.foreach(HashMap.scala:149)
> at
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:791)
> at
> org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:803)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:375)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748){noformat}
> CC [~mackrorysd], [[email protected]]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]