RahulPrakash96 commented on code in PR #3113:
URL: https://github.com/apache/polaris/pull/3113#discussion_r2554786766


##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
         });
   }
 
+  /**
+   * Fetches an Azure access token with timeout and retry logic to handle 
transient failures.
+   *
+   * <p>This method implements a defensive strategy against slow or failing 
token requests:
+   *
+   * <ul>
+   *   <li>15-second timeout per individual request attempt
+   *   <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+   *   <li>90-second overall timeout as a safety net
+   * </ul>
+   *
+   * @param tenantId the Azure tenant ID
+   * @return the access token
+   * @throws RuntimeException if token fetch fails after all retries or times 
out
+   */
   private AccessToken getAccessToken(String tenantId) {
     String scope = "https://storage.azure.com/.default";;
     AccessToken accessToken =
         defaultAzureCredential
             .getToken(new 
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+            .doOnError(
+                error ->
+                    LOGGER.warn(
+                        "Error fetching Azure access token for tenant {}: {}",
+                        tenantId,
+                        error.getMessage()))
+            .retryWhen(
+                Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 
8s
+                    .jitter(0.5) // ±50% jitter to prevent thundering herd
+                    .filter(
+                        throwable ->
+                            throwable instanceof 
java.util.concurrent.TimeoutException
+                                || isRetriableAzureException(throwable))
+                    .doBeforeRetry(
+                        retrySignal ->
+                            LOGGER.info(
+                                "Retrying Azure token fetch for tenant {} 
(attempt {}/3)",
+                                tenantId,
+                                retrySignal.totalRetries() + 1))
+                    .onRetryExhaustedThrow(
+                        (retryBackoffSpec, retrySignal) ->
+                            new RuntimeException(
+                                String.format(
+                                    "Azure token fetch exhausted after %d 
attempts for tenant %s",
+                                    retrySignal.totalRetries(), tenantId),
+                                retrySignal.failure())))
+            .blockOptional(Duration.ofSeconds(90)) // Maximum total wait time

Review Comment:
   Good point! I initially added the overall timeout as a safety net to ensure 
we never block indefinitely, but you're right it's unnecessary. The combination 
of per-attempt timeout and .retryWhen() with exponential backoff already 
provides sufficient protection. Removed it now. Thanks for the feedback! 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to