dimas-b commented on code in PR #3113:
URL: https://github.com/apache/polaris/pull/3113#discussion_r2550535740
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
});
}
+ /**
+ * Fetches an Azure access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
token requests:
+ *
+ * <ul>
+ * <li>15-second timeout per individual request attempt
+ * <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+ * <li>90-second overall timeout as a safety net
+ * </ul>
+ *
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
private AccessToken getAccessToken(String tenantId) {
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
- .blockOptional()
+ .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+ .doOnError(
+ error ->
+ LOGGER.warn(
+ "Error fetching Azure access token for tenant {}: {}",
+ tenantId,
+ error.getMessage()))
+ .retryWhen(
+ Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s,
8s
+ .jitter(0.5) // ±50% jitter to prevent thundering herd
+ .filter(
+ throwable ->
+ throwable instanceof
java.util.concurrent.TimeoutException
Review Comment:
`TimeoutException` is already handled by `isRetriableAzureException`, right?
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
});
}
+ /**
+ * Fetches an Azure access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
token requests:
+ *
+ * <ul>
+ * <li>15-second timeout per individual request attempt
+ * <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+ * <li>90-second overall timeout as a safety net
+ * </ul>
+ *
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
private AccessToken getAccessToken(String tenantId) {
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
- .blockOptional()
+ .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
Review Comment:
We have `RealmConfig` here, could you add a general setting in
`FeatureConfiguration` for this timeout. I suppose it could applicable to other
integrations too (but, of course, in this PR we can concentrate on Azure only)
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
});
}
+ /**
+ * Fetches an Azure access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
token requests:
+ *
+ * <ul>
+ * <li>15-second timeout per individual request attempt
+ * <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+ * <li>90-second overall timeout as a safety net
+ * </ul>
+ *
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
private AccessToken getAccessToken(String tenantId) {
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
- .blockOptional()
+ .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+ .doOnError(
+ error ->
+ LOGGER.warn(
+ "Error fetching Azure access token for tenant {}: {}",
+ tenantId,
+ error.getMessage()))
+ .retryWhen(
+ Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s,
8s
Review Comment:
Having backoff settings configurable could also be helpful.
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +314,83 @@ private void validateAccountAndContainer(
});
}
+ /**
+ * Fetches an Azure access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
token requests:
+ *
+ * <ul>
+ * <li>15-second timeout per individual request attempt
+ * <li>Exponential backoff retry (3 attempts: 2s, 4s, 8s) with 50% jitter
+ * <li>90-second overall timeout as a safety net
+ * </ul>
+ *
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
private AccessToken getAccessToken(String tenantId) {
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
- .blockOptional()
+ .timeout(Duration.ofSeconds(15)) // Per-attempt timeout
+ .doOnError(
+ error ->
+ LOGGER.warn(
+ "Error fetching Azure access token for tenant {}: {}",
+ tenantId,
+ error.getMessage()))
+ .retryWhen(
+ Retry.backoff(3, Duration.ofSeconds(2)) // 3 retries: 2s, 4s,
8s
+ .jitter(0.5) // ±50% jitter to prevent thundering herd
+ .filter(
+ throwable ->
+ throwable instanceof
java.util.concurrent.TimeoutException
+ || isRetriableAzureException(throwable))
+ .doBeforeRetry(
+ retrySignal ->
+ LOGGER.info(
+ "Retrying Azure token fetch for tenant {}
(attempt {}/3)",
+ tenantId,
+ retrySignal.totalRetries() + 1))
+ .onRetryExhaustedThrow(
+ (retryBackoffSpec, retrySignal) ->
+ new RuntimeException(
+ String.format(
+ "Azure token fetch exhausted after %d
attempts for tenant %s",
+ retrySignal.totalRetries(), tenantId),
+ retrySignal.failure())))
+ .blockOptional(Duration.ofSeconds(90)) // Maximum total wait time
Review Comment:
Why do we need this on top of `.timeout()` (line 337)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]