On 9/9/13 6:13 PM, Brian Smith wrote:
On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson cpeter...@mozilla.com wrote:
Google's Location Service prevents people from tracking individual access
points by requiring requests to include at least 2-3 access points that
Google knows are near each other. This proves the requester is near the
access points.
I assume by prevents people from tracking individual access points
means the following: Some people have a personal access point on them
(e.g. in their phone). If somebody knows the SSID and MAC of this
personal access point, then they could track this person's location by
polling the database for that (SSID, MAC) pair. Google tries to limit
this type of abuse as much as practical while providing still
providing a location service based on such crowdsourced data.
Unlike Google's Location Service, our server does not store MAC addresses or
SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
To query the location of an access point in the database, you must know both
its MAC address and current SSID.
MAC addresses are 48 bits. SSIDs are often guessable or predictable.
Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
not buying you much in terms of privacy, IMO. Basically, if you are
really trying to use this as a privacy mechanism then you should store
the MAC+SSID according to best practices for storing passwords. For
example, use PBKDF2 with a large number of iterations. Regardless of
whether you use SHA1, SHA2, PBKDF2, or something else, I will still
call whatever function you use H(x). But, I am not sure that switching
to PBKDF2 even buys you much improved privacy protection.
Switching to PBKDF2 can buy you a lot of protection from brute forcing
the database (specially if it is published as specified). So I would say
use PBKDF2 for H and not worry about concatenation vs xoring.
H1 = Hash(AP1.MAC + AP1.SSID)
H2 = Hash(AP2.MAC + AP2.SSID)
Our private database's schema looks something like:
Hash(AP1.MAC + AP1.SSID) == AP1.latitude, AP1.longitude, ...
Hash(AP2.MAC + AP2.SSID) == AP2.latitude, AP2.longitude, ...
This is a pseudonymous data set... which can be problematic ( I would
reduce the resolution
of each entry so that we can have some k-anonymity here). You could
even cluster
the locations
Our published database would include two tables. The first table would map a
random row id to metadata about an anonymous access point:
Random1 == AP1.latitude, AP1.longitude, ...
Random2 == AP2.latitude, AP2.longitude, ...
The second table's primary key would be a hash of hashes. It would map a
hash of two neighboring access points' hash IDs to a row id of the first
table. Something like:
Hash(H1 + H2) == Random1
Hash(H2 + H1) == Random2
Someone querying the published database would need to know the MAC addresses
and current SSIDs of two neighboring access points to look up either's
location.
If this is published as specified there are a couple of attacks I can
think of now:
1. If you know lets say org a has ssid Y and uses vendor Z (~18 bits of
entropy per AP) you can now lookup your
table to determine where all of the locations of that org (~ 2^36
hashes) and given current speeds of asic hashing (~ US$ 1.5K for 63e9
H/s ~= 2^37 H/s) you could do this in less than 1 sec. (penalty for
using video cards instead of asic: 100x so two mins). This assuming you
are using plain sha1/sha256.
2. If you have now a set of common AP SSID (say fonera) and potential
vendors for that system you can now test the closesness of any know
location in you exposed list for ~ 2^32 potential MAC's inless than one
sec per known location. If you dont know the vendor, think the number of
tests would not be greater than 2^38 if you can discard mac address
space. This again can the checked in a few secs.
3. From table 2 you can cluster locations of closely located AP and
given table 1 you can actually know the exact AP locations from the
clusters. You can then focus on the potential locations of interest.
So I think publishing table 2 as suggested is a bad idea.
I would start with the service first (with 3 AP locations required for
high res data) and not the public location store. I would be OK with
only 1 AP location for data retrieval if we significantly reduce the
resolution of the reply to not less than one degree (at works that is a
delta of ~20 miles) and there is more than one AP in that area.
Camilo
If you know the MAC+SSID of person X's personal access point and the
MAC+SSID of person Y's personal access point, then you can use this
database to ask the question are person X and person Y in the same
location? This seems bad. I see that you attempt to address this
below.
btw, should we use SHA-2 instead of SHA-1?
There is no reason to use SHA-1 when you have SHA-2 available.
However, as I indicated above, it isn't clear it is a good idea to be
using