At the risk of opening this email with a pun: we've invested a bunch of time on 
both desktop[0] and Android[1] addressing clock skew problems.

(And in server-side tests, too: [2].)

Auth, token, and storage requests are all Hawk-authenticated. The Hawk 
authentication process bakes in a timestamp. That timestamp necessarily comes 
from the client clock. If the client clock is too far off the server clock (and 
remember, there are three different servers in our architecture), the request 
will be rejected because the header is wrong.[3]

The solution we've used for this is skew adjustment. We maintain a skew value 
for each server, baking in this offset to future requests.

This is part and parcel of Hawk:

--- [4]
Hawk uses an interesting mechanism to ensure the clock skews are within the 
reasonable limits. When the server must fail a request on account of stale 
timestamp (MAC computed matches with the one in the request but timestamp is 
outside of the allowable skew), the server sends the timestamp (ts) as per the 
server clock along with a MAC (tsm) computed using the client credentials, in 
the WWW-Authenticate response header like so.

  HTTP/1.1 401 Unauthorized
  WWW-Authenticate: Hawk ts="1353832234",
                         tsm="6G8r5JiE+NLoym+WwjeHzjDNCUtLNIxmo1vpMofpLAE="
---

--- [5]
Using a timestamp requires the client's clock to be in sync with the server's 
clock. Hawk requires both the client clock and the server clock to use NTP to 
ensure synchronization. However, given the limitations of some client types 
(e.g. browsers) to deploy NTP, the server provides the client with its current 
time (in seconds precision) in response to a bad timestamp.

There is no expectation that the client will adjust its system clock to match 
the server (in fact, this would be a potential attack vector). Instead, the 
client only uses the server's time to calculate an offset used only for 
communications with that particular server. The protocol rewards clients with 
synchronized clocks by reducing the number of round trips required to 
authenticate the first request.
---


Correct and efficient usage of Hawk is predicated on clients with correct 
clocks, which seems like an insane assumption to make: at least 3.5% of Android 
devices have clocks that are incorrect by more than *1 hour*, let alone 1 
minute.[4]

Network-set Android clocks are also routinely wrong by 15s, which is 25% of the 
protocol's margin of error.

Failures due to clocks seem incredibly widespread amongst the small set of 
Mozillians who've given FxA Sync a try. That is disheartening, but not 
surprising.


We're *requiring* clients to fail frequently in the course of normal operation: 
on the first request (no known skew yet); on subsequent requests if the clock 
is adjusted since the skew was computed; on subsequent syncs if your network 
changes and your latency shifts (because our skew computation doesn't try to 
model the network); when the server clock is automatically corrected; etc.


This whole process is fragile, provides a bad user experience (your first sync 
is almost guaranteed to fail), and on an implementation level it is apparently 
hard to get right (as the existence of [3], after we've landed our skew 
handling, demonstrates).


We know we still have low-level work to do: maybe persisting skew values across 
restarts, doing better at modeling the environment to correct skews, retrying 
in more places to allow for skew-driven failures.

But this seems like a bad choice of investment. Correcting for skew seems to 
defeat some of the purpose of this timestamp validation: if you can intercept a 
request from a client whose clock is wrong in the right direction, you can save 
that token and use it later when the timestamp becomes valid, no? And 
categorizing a large chunk of requests as routinely erroneous, forcing them 
into error handling states, seems like a bad idea.


What can we do to mitigate this problem? Ideas, many of which will no doubt 
violate the promises that Hawk makes:

* Widen the validity window from 1 minute to 1 hour. Or six hours. Or three 
days.
* Do something non-conformant, like having clients pass their clock to the 
server, eliminating the requirement for clients to manage skew.
* Eliminate Hawk entirely, at least for the storage servers, switching the 
output of the token server to be some kind of short-lived bearer token.
* ???

More input, please!

-R



[0] https://bugzilla.mozilla.org/show_bug.cgi?id=957863
[1] https://bugzilla.mozilla.org/show_bug.cgi?id=962668, 
https://bugzilla.mozilla.org/show_bug.cgi?id=929066
[2] https://bugzilla.mozilla.org/show_bug.cgi?id=971059#c16
[3] https://bugzilla.mozilla.org/show_bug.cgi?id=971059
[4] 
http://lbadri.wordpress.com/2013/09/01/Hawk-authentication-for-asp-net-web-api-using-thinktecture-identitymodel-45-replay-protection/
[5] https://www.npmjs.org/package/Hawk
[6] http://opensignal.com/reports/timestamps/
_______________________________________________
Sync-dev mailing list
Sync-dev@mozilla.org
https://mail.mozilla.org/listinfo/sync-dev

Reply via email to