flyrain commented on code in PR #3113:
URL: https://github.com/apache/polaris/pull/3113#discussion_r2575839659
##########
polaris-core/src/main/java/org/apache/polaris/core/config/FeatureConfiguration.java:
##########
@@ -438,4 +438,47 @@ public static void enforceFeatureEnabledOrThrow(
"If set to true (default), allow credential vending for external
catalogs. Note this requires ALLOW_EXTERNAL_CATALOG_CREDENTIAL_VENDING to be
true first.")
.defaultValue(true)
.buildFeatureConfiguration();
+
+ public static final FeatureConfiguration<Integer> CLOUD_API_TIMEOUT_MILLIS =
Review Comment:
`CLOUD_API_TIMEOUT_MILLIS` may over-communicate its scope. Would a more
precise name like `STORAGE_API_TIMEOUT_MILLIS` or even
`AZURE_STORAGE_API_TIMEOUT_MILLIS` be clearer?
If s3/GCS/MinIO doesn't have similar configuration or cannot be applied with
the same timeout, I'd suggest to have a prefix `AZURE_` to make it clear.
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +318,103 @@ private void validateAccountAndContainer(
});
}
- private AccessToken getAccessToken(String tenantId) {
+ /**
+ * Fetches an Azure AD access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This access token is used internally to obtain a user delegation key
from Azure Storage,
+ * which is then used to generate SAS tokens for client credential vending.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
cloud provider requests:
+ *
+ * <ul>
+ * <li>Per-attempt timeout (configurable via CLOUD_API_TIMEOUT_MILLIS,
default 15000ms)
+ * <li>Exponential backoff retry (configurable count and initial delay via
CLOUD_API_RETRY_COUNT
+ * and CLOUD_API_RETRY_DELAY_MILLIS, defaults: 3 attempts starting at
2000ms)
+ * <li>Jitter to prevent thundering herd (configurable via
CLOUD_API_RETRY_JITTER_FACTOR,
+ * default 0.5 = 50%%)
+ * </ul>
+ *
+ * @param realmConfig the realm configuration to get timeout and retry
settings
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
+ private AccessToken getAccessToken(RealmConfig realmConfig, String tenantId)
{
+ int timeoutMillis = realmConfig.getConfig(CLOUD_API_TIMEOUT_MILLIS);
+ int retryCount = realmConfig.getConfig(CLOUD_API_RETRY_COUNT);
+ int initialDelayMillis =
realmConfig.getConfig(CLOUD_API_RETRY_DELAY_MILLIS);
+ double jitter = realmConfig.getConfig(CLOUD_API_RETRY_JITTER_FACTOR);
+
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
+ .timeout(Duration.ofMillis(timeoutMillis))
+ .doOnError(
+ error ->
+ LOGGER.warn(
+ "Error fetching Azure access token for tenant {}: {}",
+ tenantId,
+ error.getMessage()))
Review Comment:
Can we log the full stack trace instead of `error.getMessage()` only for
better debuggability?
##########
polaris-core/src/main/java/org/apache/polaris/core/storage/azure/AzureCredentialsStorageIntegration.java:
##########
@@ -312,16 +318,103 @@ private void validateAccountAndContainer(
});
}
- private AccessToken getAccessToken(String tenantId) {
+ /**
+ * Fetches an Azure AD access token with timeout and retry logic to handle
transient failures.
+ *
+ * <p>This access token is used internally to obtain a user delegation key
from Azure Storage,
+ * which is then used to generate SAS tokens for client credential vending.
+ *
+ * <p>This method implements a defensive strategy against slow or failing
cloud provider requests:
+ *
+ * <ul>
+ * <li>Per-attempt timeout (configurable via CLOUD_API_TIMEOUT_MILLIS,
default 15000ms)
+ * <li>Exponential backoff retry (configurable count and initial delay via
CLOUD_API_RETRY_COUNT
+ * and CLOUD_API_RETRY_DELAY_MILLIS, defaults: 3 attempts starting at
2000ms)
+ * <li>Jitter to prevent thundering herd (configurable via
CLOUD_API_RETRY_JITTER_FACTOR,
+ * default 0.5 = 50%%)
+ * </ul>
+ *
+ * @param realmConfig the realm configuration to get timeout and retry
settings
+ * @param tenantId the Azure tenant ID
+ * @return the access token
+ * @throws RuntimeException if token fetch fails after all retries or times
out
+ */
+ private AccessToken getAccessToken(RealmConfig realmConfig, String tenantId)
{
+ int timeoutMillis = realmConfig.getConfig(CLOUD_API_TIMEOUT_MILLIS);
+ int retryCount = realmConfig.getConfig(CLOUD_API_RETRY_COUNT);
+ int initialDelayMillis =
realmConfig.getConfig(CLOUD_API_RETRY_DELAY_MILLIS);
+ double jitter = realmConfig.getConfig(CLOUD_API_RETRY_JITTER_FACTOR);
+
String scope = "https://storage.azure.com/.default";
AccessToken accessToken =
defaultAzureCredential
.getToken(new
TokenRequestContext().addScopes(scope).setTenantId(tenantId))
+ .timeout(Duration.ofMillis(timeoutMillis))
+ .doOnError(
+ error ->
+ LOGGER.warn(
+ "Error fetching Azure access token for tenant {}: {}",
+ tenantId,
+ error.getMessage()))
+ .retryWhen(
+ Retry.backoff(retryCount,
Duration.ofMillis(initialDelayMillis))
+ .jitter(jitter) // Apply jitter factor to computed delay
+ .filter(this::isRetriableAzureException)
+ .doBeforeRetry(
+ retrySignal ->
+ LOGGER.info(
+ "Retrying Azure token fetch for tenant {}
(attempt {}/{})",
+ tenantId,
+ retrySignal.totalRetries() + 1,
+ retryCount))
Review Comment:
Should we add 1 to retryCount? Sot it matches the word "attempt" here? The
total attempt number would be 1 + retryCount:
```
int maxAttempts = retryCount + 1;
...
LOGGER.info("Retrying Azure token fetch for tenant {} (attempt {}/{})",
tenantId,
retrySignal.totalRetries() + 1,
maxAttempts);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]