mwylde opened a new issue, #6625:
URL: https://github.com/apache/arrow-rs/issues/6625

   **Describe the bug**
   
   When using object_store on a GKE pod with workload credentials, we see a 
huge volume of requests to the metadata endpoint to refresh the token (this 
appears in the log as a stream of "fetching token from metadata server" lines, 
within 1-2 ms of each other). This can overload the metadata service, 
preventing future work within the service.
   
   This is caused by the implementation of the TokenCache:
   
   
https://github.com/apache/arrow-rs/blob/a9294d7b06ce230f738c8bef25a1fd9a3b3e095c/object_store/src/client/token.rs#L64-L76
   
   The token cache is supposed to prevent multiple requests to fetch the token 
by reusing a cached token. However, if the token is closed to expiry (within 
the min_ttl time) it will attempt to refresh it, then set the new token in the 
cache.
   
   In this case, what's happening is that the GKE metadata service returns the 
same token for every call up until ~5 minutes before expiry, at which point it 
generates a new token with an expiry of 1 hour. But min_ttl is hard-coded to 5 
minutes (300 seconds).
   
   This creates the potential for a race condition, where if a high volume of 
calls come into the object_store at ~5 minutes until expiry, they may each:
   1. Lock the mutex
   2. Observe the cached token is near expiry
   3. Get a new token (which is the same as the old token, with the same <5min 
expiry time)
   4. Save that in the cache
   5. Release the mutex lock
   
   which is what we observe from our logs. If enough requests come in one will 
overload the service, leaving the mutex locked and preventing any further use 
of the object_store. For reasons I don't quite understand, the requests don't 
seem to ever time out, leaving the store stuck until we restart the service.
   
   **To Reproduce**
   
   Run a service (for example https://github.com/ArroyoSystems/arroyo) on GKE 
with a workload identity writing to GCS. Make a high volume of parallel 
requests to the object store, wait an hour, see that many requests are made to 
the metadata service.
   
   **Expected behavior**
   
   Only one request should be made to the metadata service.
   
   **Proposed solutions**
   
   A simple fix is to just reduce the min_ttl for GCS to <=4 minutes. However, 
I think it's dangerous to rely on the exact behavior of the token generation in 
a generic subsystem like the token cache. A better solution might look like an 
asynchronous refresh process that's kicked off when the min_ttl is hit, and 
runs (with appropriate backoff) until it successfully gets a token with expiry 
> min_ttl. This would also avoid the latency impact of doing token fetching 
within the request itself


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to