[I] min_ttl is too short for GKE tokens [arrow-rs]

via GitHub Thu, 24 Oct 2024 14:15:24 -0700


mwylde opened a new issue, #6625:
URL: https://github.com/apache/arrow-rs/issues/6625

**Describe the bug**

When using object_store on a GKE pod with workload credentials, we see a
huge volume of requests to the metadata endpoint to refresh the token (this
appears in the log as a stream of "fetching token from metadata server" lines,
within 1-2 ms of each other). This can overload the metadata service,
preventing future work within the service.

This is caused by the implementation of the TokenCache:

https://github.com/apache/arrow-rs/blob/a9294d7b06ce230f738c8bef25a1fd9a3b3e095c/object_store/src/client/token.rs#L64-L76

The token cache is supposed to prevent multiple requests to fetch the token
by reusing a cached token. However, if the token is closed to expiry (within
the min_ttl time) it will attempt to refresh it, then set the new token in the
cache.

In this case, what's happening is that the GKE metadata service returns the
same token for every call up until ~5 minutes before expiry, at which point it
generates a new token with an expiry of 1 hour. But min_ttl is hard-coded to 5
minutes (300 seconds).

This creates the potential for a race condition, where if a high volume of
calls come into the object_store at ~5 minutes until expiry, they may each:
1. Lock the mutex
2. Observe the cached token is near expiry
3. Get a new token (which is the same as the old token, with the same <5min
expiry time)
4. Save that in the cache
5. Release the mutex lock

which is what we observe from our logs. If enough requests come in one will
overload the service, leaving the mutex locked and preventing any further use
of the object_store. For reasons I don't quite understand, the requests don't
seem to ever time out, leaving the store stuck until we restart the service.

**To Reproduce**

Run a service (for example https://github.com/ArroyoSystems/arroyo) on GKE
with a workload identity writing to GCS. Make a high volume of parallel
requests to the object store, wait an hour, see that many requests are made to
the metadata service.

**Expected behavior**

Only one request should be made to the metadata service.

**Proposed solutions**

A simple fix is to just reduce the min_ttl for GCS to <=4 minutes. However,
I think it's dangerous to rely on the exact behavior of the token generation in
a generic subsystem like the token cache. A better solution might look like an
asynchronous refresh process that's kicked off when the min_ttl is hit, and
runs (with appropriate backoff) until it successfully gets a token with expiry
> min_ttl. This would also avoid the latency impact of doing token fetching
within the request itself

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] min_ttl is too short for GKE tokens [arrow-rs]

Reply via email to