I'm not sure if this exactly addresses your needs, but I believe it is relevant:

https://en.wikipedia.org/wiki/Differential_privacy

HTH,

~T

----- Original Message -----
From: "Tom Lee" <[email protected]>
To: [email protected], [email protected]
Sent: Thursday, February 6, 2014 12:49:14 PM
Subject: [liberationtech] need advice on using hashes for preserving PII's 
utility for disambiguation while protecting sensitive info

We've been kicking around an idea at Sunlight that aims to use cryptographic 
ideas to resolve some of the concerns around the publication of publicly 
identifiable information in government disclosures. I could use some smart 
people to tell me what's dumb about it. 

We often face challenges related to disambiguating entities: is the John Smith 
who gave political donation A the same John Smith that gave political donation 
B? One obvious solution to this problem is to push to expand the information 
that's collected and disclosed -- if we had John's driver's license number 
(DLN), for instance, it'd be easy to disambiguate these records. But that could 
introduce privacy concerns for John. One approach to this problem (which I 
don't think government has tried) is employing a one-way hash. 

Obviously the input key space for DLNs and most other personal ID numbers is so 
small that reversing this with a dictionary attack would be trivial. You can 
add a salt, but only on a per-entity basis (not a per-record basis) if you want 
to preserve the capacity to disambiguate. That in turns calls for a lookup 
table in which the input keys are stored, which kind of defeats the point of 
using a hash (you might as well just assign random output IDs for each input 
ID). I would worry about government's ability to keep this lookup table secure, 
and I worry about the brittleness of such a system. 

Alternately, you can use a single system-wide secret (or set of secrets) to 
transform inputs into reliable outputs. I think this is less brittle and maybe 
easier to preserve as a secret, but this system might be too easily reversible 
given the ability to observe its outputs and know the universe of possible 
inputs. I'm unsure of the cryptographic options that might be appropriate here. 

For all I know, the lack of implementations using this kind of one-way 
transformation isn't about government sluggishness but rather about its 
feasibility. I'd be very curious to hear folks ideas on this score, though. My 
general hunch is that something must be possible -- even a few bits' worth of 
disambiguating information would be hugely useful to us, and presumably you're 
not leaking important amounts of information by, say, sharing the last digit of 
a DLN. So there must be a spectrum of options. But as is probably apparent, I 
don't think I've got a handle on how to think about this problem rigorously. 

Tom 

-- 
Liberationtech is public & archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
[email protected].
-- 
Liberationtech is public & archives are searchable on Google. Violations of 
list guidelines will get you moderated: 
https://mailman.stanford.edu/mailman/listinfo/liberationtech. Unsubscribe, 
change to digest, or change password by emailing moderator at 
[email protected].

Reply via email to