Serhii Nesterov created HADOOP-19620: ----------------------------------------
Summary: AzureADAuthenticator should be able to retry on UnknownHostException Key: HADOOP-19620 URL: https://issues.apache.org/jira/browse/HADOOP-19620 Project: Hadoop Common Issue Type: Improvement Components: auth Affects Versions: 3.4.1 Reporter: Serhii Nesterov When Hadoop is requested to perform operations against ADLS Gen2 storage, `AbfsRestOperation` attempts to obtain an access token from Microsoft. Underneath the hood, it uses a simple `java.net.HttpURLConnection` HTTP client. Occasionally, enviroments may run into network intermittent issues, including DNS-related `UnknownHostException`. Technically, the HTTP client throws `IOException` whose cause is `UnknownHostException`. AzureADAuthenticator in turn catches `IOException`, sets `httperror = -1` and then checks whether the error is recoverable and can be retried. It's neither an instance of `MalformedURLException`, nor an instance of `FileNotFoundException`, nor a recoverable status code (< 100 || == 408 || >= 500 && != 501 && != 505), hence a retry never occurs which is sensitive for our project causing problems with state recovery. The final exception stack trace on the client side looks as follows (Apache Spark application): {code:java} Job aborted due to stage failure: Task 14 in stage 384.0 failed 4 times, most recent failure: Lost task 14.3 in stage 384.0 TID 3087 10.244.91.7 executor 29 : Status code: -1 error code: null error message: Auth failure: HTTP Error -1; url='https://login.miicrosoftonline.com/$TENANT_ID/oauth2/v2.0/token' AzureADAuthenticator.getTokenCall threw java.net.UnknownHostException: login.microsoftonline.com at org.apache.hadoop.fs.azurebfs.services. Abfs RestOperation.executeHttpOperation Abfs RestOperation.java:321 at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.completeExecute AbfsRestOperation.java:263 at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.lambda$exe_cute$0 AbfsRestOperation.java:235 at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.measureDurationOfInvocation IOStatisticsBinding.java:494 at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.trackDurationOfInvocation IOStatisticsBinding.java:465 at org.apache.hadoop.fs.azurebfs.services. AbfsRestOperation.exe_cute Abfs RestOperation.java:233 at org.apache.hadoop.fs.azurebfs.services. AbfsClient.getPathStatus AbfsClient.java:1099 at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystemStore.getFileStatus AzureBlobFileSystemStore.java:1164 at org.apache.hadoop.fs.azurebfs. Azure BlobFileSystem.getFileStatus AzureBlobFileSystem.java:766 at org.apache.hadoop.fs.azurebfs. AzureBlobFileSystem.getFileStatus AzureBlobFileSystem.java:756 at org.apache.parquet.hadoop.util.HadoopInputFile.fromPath HadoopInputFile.java:39 at org.apache.spark.sql.execution.datasources. parquet. ParquetFooterReader.readFooter ParquetFooterReader.java:39 at org.apache.spark.sql.execution.datasources.parquet. ParquetFileFormat.footerFileMetaData$lzycompute$1 ParquetFileFormat.scala:211 at org.apache.spark.sql.execution.datasources.parquet. ParquetFileFormat.footerFileMetaData$1 ParquetFile Format.scala:210 at org.apache.spark.sql.execution.datasources.parquet. ParquetFileFormat.$anonfun$buildReaderWithPartitionValues$2 ParquetFileFormat.scala:213 ...{code} I can see this exception is recovered in other parts of the Hadoop project (e.g., `DefaultAMSProcessor`) We would like to have similar retry mechanisms for fetching tokens. Moreover, `AbfsRestOperation` already handles and retries `UnknownHostException` but that part seems to be applicable only to storage communication, not token retrieval. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-dev-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-dev-h...@hadoop.apache.org