dimas-b commented on code in PR #3113:
URL: https://github.com/apache/polaris/pull/3113#discussion_r2550535740


##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
         });
   }
 
+  /**
+   * Fetches an Azure access token with timeout and retry logic to handle 
transient failures.
+   *
+   * <p>This method implements a defensive strategy against slow or failing 
token requests:
+   *
+   * <ul>
+   *   <li>15-second timeout per individual request attempt
+   *   <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+   *   <li>90-second overall timeout as a safety net
+   * </ul>
+   *
+   * @param tenantId the Azure tenant ID
+   * @return the access token
+   * @throws RuntimeException if token fetch fails after all retries or times 
out
+   */
   private AccessToken getAccessToken(String tenantId) {
     String scope = "https://storage.azure.com/.default";;
     AccessToken accessToken =
         defaultAzureCredential
             .getToken(new 
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+            .doOnError(
+                error ->
+                    LOGGER.warn(
+                        "Error fetching Azure access token for tenant {}: {}",
+                        tenantId,
+                        error.getMessage()))
+            .retryWhen(
+                Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 
8s
+                    .jitter(0.5) // ±50% jitter to prevent thundering herd
+                    .filter(
+                        throwable ->
+                            throwable instanceof 
java.util.concurrent.TimeoutException

Review Comment:
   `TimeoutException` is already handled by `isRetriableAzureException`, right?



##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
         });
   }
 
+  /**
+   * Fetches an Azure access token with timeout and retry logic to handle 
transient failures.
+   *
+   * <p>This method implements a defensive strategy against slow or failing 
token requests:
+   *
+   * <ul>
+   *   <li>15-second timeout per individual request attempt
+   *   <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+   *   <li>90-second overall timeout as a safety net
+   * </ul>
+   *
+   * @param tenantId the Azure tenant ID
+   * @return the access token
+   * @throws RuntimeException if token fetch fails after all retries or times 
out
+   */
   private AccessToken getAccessToken(String tenantId) {
     String scope = "https://storage.azure.com/.default";;
     AccessToken accessToken =
         defaultAzureCredential
             .getToken(new 
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout

Review Comment:
   We have `RealmConfig` here, could you add a general setting in 
`FeatureConfiguration` for this timeout. I suppose it could applicable to other 
integrations too (but, of course, in this PR we can concentrate on Azure only)



##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
         });
   }
 
+  /**
+   * Fetches an Azure access token with timeout and retry logic to handle 
transient failures.
+   *
+   * <p>This method implements a defensive strategy against slow or failing 
token requests:
+   *
+   * <ul>
+   *   <li>15-second timeout per individual request attempt
+   *   <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+   *   <li>90-second overall timeout as a safety net
+   * </ul>
+   *
+   * @param tenantId the Azure tenant ID
+   * @return the access token
+   * @throws RuntimeException if token fetch fails after all retries or times 
out
+   */
   private AccessToken getAccessToken(String tenantId) {
     String scope = "https://storage.azure.com/.default";;
     AccessToken accessToken =
         defaultAzureCredential
             .getToken(new 
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+            .doOnError(
+                error ->
+                    LOGGER.warn(
+                        "Error fetching Azure access token for tenant {}: {}",
+                        tenantId,
+                        error.getMessage()))
+            .retryWhen(
+                Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 
8s

Review Comment:
   Having backoff settings configurable could also be helpful.



##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
         });
   }
 
+  /**
+   * Fetches an Azure access token with timeout and retry logic to handle 
transient failures.
+   *
+   * <p>This method implements a defensive strategy against slow or failing 
token requests:
+   *
+   * <ul>
+   *   <li>15-second timeout per individual request attempt
+   *   <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+   *   <li>90-second overall timeout as a safety net
+   * </ul>
+   *
+   * @param tenantId the Azure tenant ID
+   * @return the access token
+   * @throws RuntimeException if token fetch fails after all retries or times 
out
+   */
   private AccessToken getAccessToken(String tenantId) {
     String scope = "https://storage.azure.com/.default";;
     AccessToken accessToken =
         defaultAzureCredential
             .getToken(new 
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
-            .blockOptional()
+            .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+            .doOnError(
+                error ->
+                    LOGGER.warn(
+                        "Error fetching Azure access token for tenant {}: {}",
+                        tenantId,
+                        error.getMessage()))
+            .retryWhen(
+                Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s, 
8s
+                    .jitter(0.5) // ±50% jitter to prevent thundering herd
+                    .filter(
+                        throwable ->
+                            throwable instanceof 
java.util.concurrent.TimeoutException
+                                || isRetriableAzureException(throwable))
+                    .doBeforeRetry(
+                        retrySignal ->
+                            LOGGER.info(
+                                "Retrying Azure token fetch for tenant {} 
(attempt {}/3)",
+                                tenantId,
+                                retrySignal.totalRetries() + 1))
+                    .onRetryExhaustedThrow(
+                        (retryBackoffSpec, retrySignal) ->
+                            new RuntimeException(
+                                String.format(
+                                    "Azure token fetch exhausted after %d 
attempts for tenant %s",
+                                    retrySignal.totalRetries(), tenantId),
+                                retrySignal.failure())))
+            .blockOptional(Duration.ofSeconds(90)) // Maximum total wait time

Review Comment:
   Why do we need this on top of `.timeout()` (line 337)?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to