Re: Request for feedback on crypto privacy protections of geolocation data

Hanno Schlichting Mon, 09 Sep 2013 19:17:12 -0700

On 09.09.2013, at 18:13 , Brian Smith <br...@briansmith.org> wrote:
> On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson <cpeter...@mozilla.com> wrote:
>> Google's Location Service prevents people from tracking individual access
>> points by requiring requests to include at least 2-3 access points that
>> Google knows are near each other. This "proves" the requester is near the
>> access points.
> 
> I assume by "prevents people from tracking individual access points"
> means the following: Some people have a personal access point on them
> (e.g. in their phone). If somebody knows the SSID and MAC of this
> personal access point, then they could track this person's location by
> polling the database for that (SSID, MAC) pair. Google tries to limit
> this type of abuse as much as practical while providing still
> providing a location service based on such crowdsourced data.


Yes :) Though there's one crucial difference between Google and us: We would 
like to make as much of this data public as possible, while Google will always 
just provide a service without access to the underlying data.

>> Unlike Google's Location Service, our server does not store MAC addresses or
>> SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
>> To query the location of an access point in the database, you must know both
>> its MAC address and current SSID.
> 
> MAC addresses are 48 bits. SSIDs are often guessable or predictable.
> Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
> not buying you much in terms of privacy, IMO. Basically, if you are
> really trying to use this as a privacy mechanism then you should store
> the MAC+SSID according to best practices for storing passwords. For
> example, use PBKDF2 with a large number of iterations. Regardless of
> whether you use SHA1, SHA2, PBKDF2, or something else, I will still
> call whatever function you use H(x). But, I am not sure that switching
> to PBKDF2 even buys you much improved privacy protection.

We were looking for two things with using the sha1:

- Make it possible for the end-user to change their unique value (they cannot 
change the mac address, but they can change the ssid). This allows them to 
"invalidate" historical records in the database.
- Make it harder for spammers to "guess" actual unique keys and flood our 
service. Mac addresses have a vendor prefix, which makes it rather easy to 
generate lots of valid mac addresses. Taking the ssid into account makes it 
harder to generate valid keys. Unfortunately the ssid itself is considered 
private data in European countries, so you aren't allowed to store it without 
the users consent. That's why Google and everyone else has stopped storing them 
and only use mac addresses now.

The sha1 scheme might be ineffective in doing this.

>>    H1 = Hash(AP1.MAC + AP1.SSID)
>>    H2 = Hash(AP2.MAC + AP2.SSID)
>> 
>> Our private database's schema looks something like:
>> 
>>    Hash(AP1.MAC + AP1.SSID) ==> AP1.latitude, AP1.longitude, ...
>>    Hash(AP2.MAC + AP2.SSID) ==> AP2.latitude, AP2.longitude, ...
>> 
>> Our published database would include two tables. The first table would map a
>> random row id to metadata about an anonymous access point:
>> 
>>    Random1 ==> AP1.latitude, AP1.longitude, ...
>>    Random2 ==> AP2.latitude, AP2.longitude, ...
>> 
>> The second table's primary key would be a hash of hashes. It would map a
>> hash of two neighboring access points' hash IDs to a row id of the first
>> table. Something like:
>> 
>>    Hash(H1 + H2) ==> Random1
>>    Hash(H2 + H1) ==> Random2
>> 
>> Someone querying the published database would need to know the MAC addresses
>> and current SSIDs of two neighboring access points to look up either's
>> location.
> 
> If  you know the MAC+SSID of person X's personal access point and the
> MAC+SSID of person Y's personal access point, then you can use this
> database to ask the question "are person X and person Y in the same
> location?" This seems bad. I see that you attempt to address this
> below.

On the service level, we can prevent this with adding extra thresholds. Like 
filtering out "moving" APs and only reporting APs which have been seen in the 
same location a number of times over a minimum time period.

But this doesn't help us when publishing the underlying data.

>> btw, should we use SHA-2 instead of SHA-1?
> 
> There is no reason to use SHA-1 when you have SHA-2 available.
> However, as I indicated above, it isn't clear it is a good idea to be
> using any plain hash function as H(x).
> 
>> Other layers of privacy protection include filtering out ad-hoc Wi-Fi
>> networks; MAC addresses with vendor prefixes from mobile device manufacters
>> (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
>> "XXX's iPhone" and Google's "_nomap" opt-out); and APs reported in multiple
>> locations.
> 
> I think that these things are much more important than the protection
> offered by H(x). My concern is that if you store the data on the
> server as H(x) then you will not be able to do the above filtering on
> the server unless H(x) is ineffective. That seems bad, because the
> server will be much easier to update to improve the filtering than the
> clients will be, AFAICT. Also, you will not be able to measure the
> effectiveness of the privacy protections on the server, which is also
> very bad.
> 
> Therefore, I'd suggest that you avoid using any protection at all, and
> just use x instead of H(x) until we are very confident there is no way
> we can further improve the filtering.

This sounds like good advice and I'm starting to lean into this direction.

But this only helps us on the "we provide a service" side. It's still unclear 
to me if and how we could share any of this data as database dumps.

Hanno
_______________________________________________
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security

Re: Request for feedback on crypto privacy protections of geolocation data

Reply via email to