On Mon, Jan 13, 2025 at 3:21 PM Jacob Champion <jacob.champ...@enterprisedb.com> wrote: > Next email will discuss the architectural bug that Kashif found.
Okay, here goes. A standard OAuth connection attempt looks like this (oh, I hope Gmail doesn't mangle it): Issuer User libpq Backend | | | | x -----> x -----> o [1] Startup Packet | | | | | | x <----- x [2] OAUTHBEARER Request | | | | | | x -----> x [3] Parameter Discovery | | | | | | x <----- o [4] Parameters Stored | | | | | | | | | | | x -----> o [5] New Startup Packet | | | | | | x <----- x [6] OAUTHBEARER Request | | | | x <----- x <----> x | x <----- x <----> x | [7] OAuth Handshake x <----- x <----> x | | | | | o | x -----> x [8] Send Token | | | | <----- x <----- x [9] Connection Established | | | x <----> x <----> x x <----> x <----> x [10] Use the DB . . . . . . . . . When the server first asks for a token via OAUTHBEARER (step 2), the client doesn't necessarily know what the server's requirements are for a given user. It uses the rest of the doomed OAUTHBEARER exchange to store the issuer and scope information in the PGconn (step 3-4), then disconnects and sets need_new_connection in PQconnectPoll() so that a second connection is immediately opened (step 5). When the OAUTHBEARER mechanism takes control the second time, it has everything it needs to conduct the login flow with the issuer (step 7). It then sends the obtained token to establish a connection (steps 8 onward). The problem is that step 7 is consuming the authentication_timeout for the backend. I'm very good at completing these flows quickly, but if you can't complete the browser prompts in time, you will simply not be able to log into the server. Which is harsh to say the least. (Imagine the pain if the standard psql password prompt timed out.) DBAs can get around it by increasing the timeout, obviously, but that doesn't feel very good as a solution. Last week I looked into a fix where libpq would simply try again with the stored token if the backend hangs up on it during the handshake, but I think that will end up making the UX worse. The token validation on the server side isn't going to be instantaneous, so if the client is able to complete the token exchange in 59 seconds and send it to the backend, there's an excellent chance that the connection is still going to be torn down in a way that's indistinguishable from a crash. We don't want the two sides to fight for time. So I think what I'm going to need to do is modify v41-0003 to allow the mechanism to politely hang up the connection while the flow is in progress. This further decouples the lifetimes of the mechanism and the async auth -- the async state now has to live outside of the SASL exchange -- but I think it's probably more architecturally sound. Yell at me if that sounds unmaintainable or if there's a more obvious fix I'm missing. Huge thanks to Kashif for pointing this out! --Jacob