At the risk of opening this email with a pun: we've invested a bunch of time on both desktop[0] and Android[1] addressing clock skew problems.
(And in server-side tests, too: [2].) Auth, token, and storage requests are all Hawk-authenticated. The Hawk authentication process bakes in a timestamp. That timestamp necessarily comes from the client clock. If the client clock is too far off the server clock (and remember, there are three different servers in our architecture), the request will be rejected because the header is wrong.[3] The solution we've used for this is skew adjustment. We maintain a skew value for each server, baking in this offset to future requests. This is part and parcel of Hawk: --- [4] Hawk uses an interesting mechanism to ensure the clock skews are within the reasonable limits. When the server must fail a request on account of stale timestamp (MAC computed matches with the one in the request but timestamp is outside of the allowable skew), the server sends the timestamp (ts) as per the server clock along with a MAC (tsm) computed using the client credentials, in the WWW-Authenticate response header like so. HTTP/1.1 401 Unauthorized WWW-Authenticate: Hawk ts="1353832234", tsm="6G8r5JiE+NLoym+WwjeHzjDNCUtLNIxmo1vpMofpLAE=" --- --- [5] Using a timestamp requires the client's clock to be in sync with the server's clock. Hawk requires both the client clock and the server clock to use NTP to ensure synchronization. However, given the limitations of some client types (e.g. browsers) to deploy NTP, the server provides the client with its current time (in seconds precision) in response to a bad timestamp. There is no expectation that the client will adjust its system clock to match the server (in fact, this would be a potential attack vector). Instead, the client only uses the server's time to calculate an offset used only for communications with that particular server. The protocol rewards clients with synchronized clocks by reducing the number of round trips required to authenticate the first request. --- Correct and efficient usage of Hawk is predicated on clients with correct clocks, which seems like an insane assumption to make: at least 3.5% of Android devices have clocks that are incorrect by more than *1 hour*, let alone 1 minute.[4] Network-set Android clocks are also routinely wrong by 15s, which is 25% of the protocol's margin of error. Failures due to clocks seem incredibly widespread amongst the small set of Mozillians who've given FxA Sync a try. That is disheartening, but not surprising. We're *requiring* clients to fail frequently in the course of normal operation: on the first request (no known skew yet); on subsequent requests if the clock is adjusted since the skew was computed; on subsequent syncs if your network changes and your latency shifts (because our skew computation doesn't try to model the network); when the server clock is automatically corrected; etc. This whole process is fragile, provides a bad user experience (your first sync is almost guaranteed to fail), and on an implementation level it is apparently hard to get right (as the existence of [3], after we've landed our skew handling, demonstrates). We know we still have low-level work to do: maybe persisting skew values across restarts, doing better at modeling the environment to correct skews, retrying in more places to allow for skew-driven failures. But this seems like a bad choice of investment. Correcting for skew seems to defeat some of the purpose of this timestamp validation: if you can intercept a request from a client whose clock is wrong in the right direction, you can save that token and use it later when the timestamp becomes valid, no? And categorizing a large chunk of requests as routinely erroneous, forcing them into error handling states, seems like a bad idea. What can we do to mitigate this problem? Ideas, many of which will no doubt violate the promises that Hawk makes: * Widen the validity window from 1 minute to 1 hour. Or six hours. Or three days. * Do something non-conformant, like having clients pass their clock to the server, eliminating the requirement for clients to manage skew. * Eliminate Hawk entirely, at least for the storage servers, switching the output of the token server to be some kind of short-lived bearer token. * ??? More input, please! -R [0] https://bugzilla.mozilla.org/show_bug.cgi?id=957863 [1] https://bugzilla.mozilla.org/show_bug.cgi?id=962668, https://bugzilla.mozilla.org/show_bug.cgi?id=929066 [2] https://bugzilla.mozilla.org/show_bug.cgi?id=971059#c16 [3] https://bugzilla.mozilla.org/show_bug.cgi?id=971059 [4] http://lbadri.wordpress.com/2013/09/01/Hawk-authentication-for-asp-net-web-api-using-thinktecture-identitymodel-45-replay-protection/ [5] https://www.npmjs.org/package/Hawk [6] http://opensignal.com/reports/timestamps/ _______________________________________________ Sync-dev mailing list Sync-dev@mozilla.org https://mail.mozilla.org/listinfo/sync-dev