> So how can we make the gathering of these metrics feel as > privacy-sensitive, as safe, as *right* as possible?
I'm not going to comment on what feels "creepy" or "right", only that I was reminded of the following as I read through Ryan's concerns. A while back Mayo pointed us at RAPPOR[1][2], a "novel privacy technology that allows inferring statistics about populations while preserving the privacy of individual users". The approach RAPPOR takes works better with data submitted by the client (e.g. our telemetry system) than data derived from server logs, as any particular application server will only be able to log events for users that actually use the service. -whd [1] https://github.com/google/rappor [2] http://googleresearch.blogspot.com/2014/10/learning-statistics-with-privacy-aided.html On Mon, May 4, 2015 at 10:33 PM, Ryan Kelly <[email protected]> wrote: > On 4/05/2015 15:23, Andrew Chilton wrote: > > I'm currently looking at FxA user metrics this quarter and have a few > > thoughts around some ideas. Let me kick off by just having a brief > > outline so you get an idea of what's needed. > > > > "All services attached to FxA should report metrics on user > > activity in a consistent, privacy-respecting manner. We should have > > dashboards that allow us to measure success and monitor for problems." > > Thanks for kicking this off Andy, getting better metrics is a big part > of our vision for a successful Q2. You and I have chatted a bit about > this IRL, but I'll add some more notes and context below for the benefit > of discussion on the list. > > > This may or may not be the exact goal we're reaching for, but it's > > currently a good statement to aim for. Other questions we might like to > > ask and answer in the future are: > > > > * how many users signed up this month? > > * how many users used a particular service (e.g. Hello, Sync)? > > * how many users logged in with a mobile device? > > * (perhaps other questions we don't yet know about) > > One of the tricky-but-important questions we need to answer is: > > * how many users accessed more than one FxA service this month? > > It's worth calling this one out explicitly, because this is why we need > to somehow correlate user activity across services. > > > To do this we need to figure out how to make this happen. By logging > > 'user events' we should be able to take advantage of the regular data > > pipeline running through Heka/ElasticSearch/Kibana and/or any other > > Reporting/MapReduce plus custom dashoards as aimed for by the Data > > Pipeline v2 [1]. Therefore this email is mainly concentrating on what we > > do at the edges, i.e. our application servers. > > > > We would love to correlate users across services so we need something > > which allows us to do this. Of course the uid of a user is the obvious > > answer but one that would raise some privacy questions. > > Right, so the simplest thing would be for each service to just emit a > bunch of JSON log entries like this: > > log.info({ > service: "hello" > uid: "ABCDEF123456" > event: "call", > timestamp: 1430802399476, > }) > > log.info({ > service: "readinglist" > uid: "ABCDEF123456" > event: "save_item", > timestamp: 1430803546071, > }) > > Heka could slurp this up and send them off for processing/aggregation in > the same way that we currently do for the existing FxA > monthly-active-users count. > > Would it be OK for us to just go ahead and ship it this way? > > To me it seems a little creepy. If these events are being stored > somewhere, you could potentially build up a pretty nice picture of an > individual user's activity by analyzing their stream of events. > Accidentally leaking metrics in this form would be a pretty big deal. > > Can we do it in a more privacy-conscious manner? > > > Perhaps we could > > post-process these in the data pipeline into something else, or we can > > log something locally which we could use to correlate that same user to > > another service (but not back to the user him/herself). The idea of a > > Metrics ID has been raised which is a one-way mapping from uid to > > Metrics ID (am leaving out any implementation details for now). > > We have a tiny bit of prior art here, in the monthly-active-users > counting for sync: > > https://bugzilla.mozilla.org/show_bug.cgi?id=1136014 > > For this, we wound up emitting metrics events that look like: > > log.info({ > uid: HMAC_SHA256(<secret key>, <uid>), > timestamp: 1430802399476, > ...other sync-specific metrics... > }) > > In other words, we use HMAC to derive an opaque "metrics id" from the > account uid. This lets us count unique users of the service, but makes > it harder to correlate the logs with a particular user record from FxA. > > If all the services used the same technique, we could do cross-service > activity correlation. > > I'd be interested in people's thoughts on the usefulness of this > obfuscation. > > > Of course, all services would need to know how to make that MetricsID if > it > > was logged at the edge, but if the uid was post-processed in the data > > pipeline this could be done centrally. > > Yep. If every service is able to do the uid -> metrics-id mapping at > will, then does it really gain us anything? > > > I'd love for people to weigh in with their gut reactions here, even if > you don't have any comments on the technical details. > > We will of course have to be in compliance with Mozilla's terms, privacy > policy, etc when collecting all these metrics. But IMHO saying "we're > compliant with the posted ToS!" is not much help if what we're doing > just feels wrong to people. > > So how can we make the gathering of these metrics feel as > privacy-sensitive, as safe, as *right* as possible? > > > Cheers, > > Ryan > _______________________________________________ > Dev-fxacct mailing list > [email protected] > https://mail.mozilla.org/listinfo/dev-fxacct >
_______________________________________________ Dev-fxacct mailing list [email protected] https://mail.mozilla.org/listinfo/dev-fxacct

