Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Gervase Markham
On 10/09/13 19:05, Chris Peterson wrote:
 Our location service (and stumbler) also collects cell data, so we can
 geolocate with Wi-Fi AP and/or cell data.

Sure. But in the rural areas I am thinking about, cells cover many
square km. The wifi access point has a much smaller range, and therefore
geolocates a person much more precisely.

So it would be awesome if I could say I'm in this network cell, near
this single access point - tell me where I am, please, and the service
complied.

Of course, your hash-combining idea has the problem of combinatorial
explosion if we do multiple ways of specifying 2 data points.

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Chris Peterson

On 9/11/13 9:59 AM, Hanno Schlichting wrote:

But at this point it seems clear to me, that there's likely no way to share any 
meaningful subset or aggregated version of this data publicly at all.


No way to share the Wi-Fi data. Our stumblers are also collecting cell 
tower data and I don't see any privacy reasons we can't share the 
aggregated cell data.



chris
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 10.09.2013, at 20:23 , ianG i...@iang.org wrote:
 On 11/09/13 03:27 AM, Daniel Veditz wrote:
 private means we can't even /look/ at it, rather than merely can't
 store it?
 
 The data regime might be simply put as this:  you can't store a number 
 suitable for tracking (or any derivative of it if that simply creates a new 
 tracking number) unless you have a compelling business reason, and you have 
 agreement.
 
 The EU data protection regime makes a very strong distinction about any 
 private tracking information.  It also goes to another level if you share 
 that information with anyone.
 
 The initial simple answer is, don't go there.  (I have no idea how google 
 finessed this issue, or even if they didn't.)

Most of this is very much a gray area. The data privacy officers / protection 
agencies have generally recognized that location services based on wifi 
networks are a very useful service, and in order to practically run them, you 
have to be able to collect wifi bssid's without getting the individual assent 
of every wifi AP operator.

But at the same time they consider the combination of a bssid, timestamp and 
geolocation as personally identifiable information suitable for tracking. Much 
like IP addresses, or phone numbers.

So currently there's an unspoken agreement where industry players like Google, 
Microsoft and Apple have voluntarily put some restrictions into place. One of 
those is the introduction of the _nomap network name suffix, which was deemed 
an effective way for wifi operators to opt-out of the data gathering (see for 
example 
http://www.dutchdpa.nl/Pages/en_pb_20120405_google-complies-with-Dutch-DPA-requirements.aspx).

Other cases where the introduction of the you need to know two nearby wifis 
to geolocate yourself protection. This was a measure suggested and implemented 
first by Google based on media outcries and has now become a industry 
best-practice. But it's not actually mandated by any official regulation to my 
knowledge.

For now the whole space hasn't seen official tight regulation and the industry 
players are allowed to continue to operate. But it's a fine balance and any new 
media outcries or questionable behavior can threaten this balance.

So for us this means trying to adhere to existing industry best practices and 
generally following data privacy best practices like: only gather and store 
what you need, delete data as soon as you don't need it anymore, etc.

All of this applies to the hosted service use-case, where we keep the data 
internal and don't share or sell it for other purposes. Since it's all 
unofficial agreements, it's very hard to impossible to know exactly what we 
should do for the we want to publicly share this data use-case.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 10.09.2013, at 17:41 , Daniel Veditz dved...@mozilla.com wrote:
 That can't be right, so your database must be more complex. If you're
 storing more than originally implied that may have some impact on a
 security assessment.

We apparently haven't been clear about the scope of the proposal. It only deals 
with a way to export and publicly share a subset of our data. Internally the 
service has a lot more data, but there's no way we can share that, thanks to 
the privacy aspects of it.

But at this point it seems clear to me, that there's likely no way to share any 
meaningful subset or aggregated version of this data publicly at all.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-11 Thread Hanno Schlichting
On 11.09.2013, at 02:06 , Gervase Markham g...@mozilla.org wrote:
 On 10/09/13 19:05, Chris Peterson wrote:
 Our location service (and stumbler) also collects cell data, so we can
 geolocate with Wi-Fi AP and/or cell data.
 
 Sure. But in the rural areas I am thinking about, cells cover many
 square km. The wifi access point has a much smaller range, and therefore
 geolocates a person much more precisely.
 
 So it would be awesome if I could say I'm in this network cell, near
 this single access point - tell me where I am, please, and the service
 complied.

That's a good idea, I added a ticket about it at 
https://github.com/mozilla/ichnaea/issues/23

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Chris Peterson


On 9/9/13 6:13 PM, Brian Smith wrote:

I assume by prevents people from tracking individual access points
means the following: Some people have a personal access point on them
(e.g. in their phone). If somebody knows the SSID and MAC of this
personal access point, then they could track this person's location by
polling the database for that (SSID, MAC) pair.


Tracking a person's movements by polling the database would not be 
useful because we would probably update the database infrequently (days 
or weeks). The location database would be generated offline from 
analysis of many raw measurements submitted by the stumbler app.


The tracking scenario that might be viable is a tracker who knows 
someones MAC address and current SSID and that person moves to a 
different city or state. The database delay wouldn't matter as much. The 
hash of hashes scheme tries to protect against that by requiring two 
neighboring APs.




MAC addresses are 48 bits. SSIDs are often guessable or predictable.
Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
not buying you much in terms of privacy, IMO. Basically, if you are
really trying to use this as a privacy mechanism then you should store
the MAC+SSID according to best practices for storing passwords. For
example, use PBKDF2 with a large number of iterations. Regardless of
whether you use SHA1, SHA2, PBKDF2, or something else, I will still
call whatever function you use H(x). But, I am not sure that switching
to PBKDF2 even buys you much improved privacy protection.


The primary motivation for hashing the MAC+SSID was to avoid uploading 
the SSID (which is considered private data in some European countries) 
while still using the SSID as sort of weak protection against database 
pollution from malicious stumblers reporting spoofed MAC addresses. 
Even if our database will filled with junk MAC address, real clients 
would probably not see the same combination of MAC and SSID in the real 
world when they sent a geolocation request to the server.




Other layers of privacy protection include filtering out ad-hoc Wi-Fi
networks; MAC addresses with vendor prefixes from mobile device manufacters
(e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
XXX's iPhone and Google's _nomap opt-out); and APs reported in multiple
locations.


I think that these things are much more important than the protection
offered by H(x). My concern is that if you store the data on the
server as H(x) then you will not be able to do the above filtering on
the server unless H(x) is ineffective. That seems bad, because the
server will be much easier to update to improve the filtering than the
clients will be, AFAICT. Also, you will not be able to measure the
effectiveness of the privacy protections on the server, which is also
very bad.


Very good points. We are currently filtering on the stumbler client 
side. Today, the server just receives mystery hashes with latitude and 
longitude.


Given just MAC addresess, the server could still filter out ad-hoc 
networks; vendor prefixes for known mobile device manufacturers; and 
unrecognized vendor prefixes (because some mobile devices supposedly 
generate a completely random MAC addresses).


We would still need to rely on the stumbler to filter SSIDs. We can't 
upload SSIDs to the server because they are considered private data in 
some European countries (though MAC addresses, which are more unique, 
are apparently not considered private data, in a legal sense).


We've compiled a list of about 70 SSID prefixes and suffixes we've seen 
from mobile devices (e.g. Android*, Verizon *, or *'s iPhone). Not 
all of these mobile devices use ad-hoc MAC addresses.


Trivia: over a couple years of my own Wi-Fi stumbling/wardriving in 
three countries and six US states, I have recorded over 100K unique APs 
and only eight used Google's _nomap SSID opt-out suffix!



chris
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread ianG

On 10/09/13 00:58 AM, Chris Peterson wrote:

I'm looking for some feedback on crypto privacy protections for a
geolocation research project I'm working on with the Mozilla Services
team. If you have general questions or suggestions about the project,
I'm happy to answer them, but I'd like to focus this thread on crypto.

Our team is prototyping a crowd-sourced version of Google's Street View
cars to correlate Wi-Fi access points and cell towers to GPS positions.
Our primary motivation is to provide non-proprietary location services
for Firefox OS devices.



If I read this correctly, you want your client devices to figure out 
where they are, right?


If that is the case, why not flip it around.  Instead of trying to 
interpolate the existing data that is broadcast out there, why not write 
a protocol to broadcast the direct location from the wireless access point?


A lot of these routers run Linux, and this is a place where people would 
be interested in running a new service.


A wireless router that broadcasts its geolocation is not a privacy 
issue.  There is no reason why it can't be turned on by default.


But anything else requires a horrible mishmash of approaches.  To obtain 
what?  Something the wireless can easily tell you directly.




iang
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Gervase Markham
On 10/09/13 06:05, Chris Peterson wrote:
 The device would scan for nearby APs and send the hash of each AP's MAC
 and SSID to our location server. Our server would not need to worry
 about the hash of hashes pairs because that would only be used for
 published data. The server would return an estimated latitude,
 longitude, and accuracy (radius in meters) of the device among the APs.

BTW, how does the service figure out the lat/long of an AP? Do we do
anything at all with signal strengths? Could we?

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Gervase Markham
On 10/09/13 08:04, Henri Sivonen wrote:
  1) Android has a mechanism for detecting when it is connecting to a
 portable AP provided by another Android device. Can we use the same or
 a similar detection mechanism to detect portable APs and filter them
 out?

I suspect actually connecting to the APs, as opposed to passively
sniffing, might be on the project's big list of NoNos... But if we
could, I agree we could find more useful data.

 location.) Are there any plans for a crowdsourced mechanism  for
 blacklisting such APs?

Not sure about crowdsourcing, but I believe they plan to use over-time
algorithms for blocking regularly-moving APs.

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Gervase Markham
On 10/09/13 00:25, R. Jason Cronk wrote:
 Is the data aged?

Not AFAIAA.

 What happens if I move? 

The raw database notes that you are now being detected in a new
location. What happens then is up for debate. I'd argue that if your
position was fixed for N months before, and it seems fixed again now, we
should assume you have moved house and keep the point in the DB. APs
which seem to move a lot, or move regularly, should be excluded.

 Does this give Mozilla the
 ability to historically track me if I move my device? 

Yes; this is why publishing the full raw stumbled data sets is sadly
going to be not possible.

 Our published database would include two tables. The first table would
 map a random row id to metadata about an anonymous access point:

 Random1 == AP1.latitude, AP1.longitude, ...
 Random2 == AP2.latitude, AP2.longitude, ...
 
 I would be hesitant to use the word anonymous here. Latlong is easily
 combine with other publicly available databases that could identify
 individual address and thus individuals. Again, it comes down to
 granularity of the data.

I'm not sure what threat you are seeing. Can you elaborate? This is just
a list of latlongs which have a wireless access point. How can this
information assist in identifying individuals or their locations?

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Gervase Markham
On 09/09/13 22:58, Chris Peterson wrote:
 Google's Location Service prevents people from tracking individual
 access points by requiring requests to include at least 2-3 access
 points that Google knows are near each other. This proves the
 requester is near the access points.

Related question: it would be great if there were some way to lift this
restriction, at least for the web service if not for the database, while
preserving the necessary privacy protections. My family's house, which
is in a rural area, has a single access point; I want my phone to know
where it is immediately when I'm there. Not everywhere has lots of
access points.

One thought I had was to allow submission of the MMC/MNC (mobile network
IDs) as proof that you were nearby.

 Unlike Google's Location Service, our server does not store MAC
 addresses or SSIDs. We identify access points by hash IDs, specifically
 SHA1(MAC+SSID). To query the location of an access point in the
 database, you must know both its MAC address and current SSID.

I think that this is an excellent idea, for the reasons you articulate
later in the thread.

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Gervase Markham
On 10/09/13 10:48, ianG wrote:
 If that is the case, why not flip it around.  Instead of trying to
 interpolate the existing data that is broadcast out there, why not write
 a protocol to broadcast the direct location from the wireless access point?

Because only a tiny, tiny fraction of devices would run it, and for most
of those, the user wouldn't have correctly set the device's location
anyway, and for some of them, they'd have set it and then moved.

This is a boil the sea approach to the problem.

Gerv

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Chris Peterson

On 9/10/13 3:46 AM, Gervase Markham wrote:

I believe the plan is to have a database of raw findings, then a
processed database used by the web service, and a published database
which may have even more data reduction.

Chris P: can we get permission to store the raw SSID in the
_unpublished_ database?


SSIDs are considered personal data in some European countries, so we 
can't collect them without AP owner opt-in. Opt-in is infeasiable, so we 
can't even collect raw SSIDs.



chris

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Camilo Viecco

On 9/9/13 6:13 PM, Brian Smith wrote:

On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson cpeter...@mozilla.com wrote:

Google's Location Service prevents people from tracking individual access
points by requiring requests to include at least 2-3 access points that
Google knows are near each other. This proves the requester is near the
access points.

I assume by prevents people from tracking individual access points
means the following: Some people have a personal access point on them
(e.g. in their phone). If somebody knows the SSID and MAC of this
personal access point, then they could track this person's location by
polling the database for that (SSID, MAC) pair. Google tries to limit
this type of abuse as much as practical while providing still
providing a location service based on such crowdsourced data.


Unlike Google's Location Service, our server does not store MAC addresses or
SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
To query the location of an access point in the database, you must know both
its MAC address and current SSID.

MAC addresses are 48 bits. SSIDs are often guessable or predictable.
Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
not buying you much in terms of privacy, IMO. Basically, if you are
really trying to use this as a privacy mechanism then you should store
the MAC+SSID according to best practices for storing passwords. For
example, use PBKDF2 with a large number of iterations. Regardless of
whether you use SHA1, SHA2, PBKDF2, or something else, I will still
call whatever function you use H(x). But, I am not sure that switching
to PBKDF2 even buys you much improved privacy protection.
Switching to PBKDF2 can  buy you a lot of protection from brute forcing 
the database (specially if it is published as specified). So I would say 
use PBKDF2 for H and not worry about concatenation vs xoring.





 H1 = Hash(AP1.MAC + AP1.SSID)
 H2 = Hash(AP2.MAC + AP2.SSID)

Our private database's schema looks something like:

 Hash(AP1.MAC + AP1.SSID) == AP1.latitude, AP1.longitude, ...
 Hash(AP2.MAC + AP2.SSID) == AP2.latitude, AP2.longitude, ...
This is a pseudonymous data set... which can be problematic ( I would 
reduce the resolution
of each entry so that we can have some k-anonymity here).  You could 
even cluster

the locations

Our published database would include two tables. The first table would map a
random row id to metadata about an anonymous access point:

 Random1 == AP1.latitude, AP1.longitude, ...
 Random2 == AP2.latitude, AP2.longitude, ...

The second table's primary key would be a hash of hashes. It would map a
hash of two neighboring access points' hash IDs to a row id of the first
table. Something like:

 Hash(H1 + H2) == Random1
 Hash(H2 + H1) == Random2

Someone querying the published database would need to know the MAC addresses
and current SSIDs of two neighboring access points to look up either's
location.


If this is published  as specified there are a couple of attacks I can 
think of now:
1. If you know lets say org a has ssid Y and uses vendor Z (~18 bits of 
entropy per AP) you can now lookup your
table to determine where all of the locations of that org (~ 2^36 
hashes) and given current speeds of asic hashing (~ US$ 1.5K for 63e9 
H/s ~= 2^37 H/s) you could do this in less than 1 sec. (penalty for 
using video cards instead of asic: 100x so two mins). This assuming you 
are using plain sha1/sha256.


2. If you have now a set of common AP SSID (say fonera) and potential 
vendors for that system you can now test the closesness of any know 
location in you exposed list for ~ 2^32 potential MAC's inless than one 
sec per known location. If you dont know the vendor, think the number of 
tests would not be greater than 2^38 if you can discard mac address 
space. This again can the checked in a few secs.


3. From table 2 you can cluster locations of closely located AP and 
given table 1 you can actually know the exact AP locations from the 
clusters.  You can then focus on the potential locations of interest.


So I think publishing table 2  as suggested is a bad idea.

I would start with the service first (with 3 AP locations required for 
high res data) and not the public location store. I would be OK with 
only 1 AP location for data retrieval if we significantly reduce the 
resolution of the reply to not less than one degree (at works that is a 
delta of ~20 miles) and there is more than one AP in that area.


Camilo


If  you know the MAC+SSID of person X's personal access point and the
MAC+SSID of person Y's personal access point, then you can use this
database to ask the question are person X and person Y in the same
location? This seems bad. I see that you attempt to address this
below.


btw, should we use SHA-2 instead of SHA-1?

There is no reason to use SHA-1 when you have SHA-2 available.
However, as I indicated above, it isn't clear it is a good idea to be
using 

Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Chris Peterson

On 9/10/13 3:46 AM, Gervase Markham wrote:

Related question: it would be great if there were some way to lift this
restriction, at least for the web service if not for the database, while
preserving the necessary privacy protections. My family's house, which
is in a rural area, has a single access point; I want my phone to know
where it is immediately when I'm there. Not everywhere has lots of
access points.

One thought I had was to allow submission of the MMC/MNC (mobile network
IDs) as proof that you were nearby.


Our location service (and stumbler) also collects cell data, so we can 
geolocate with Wi-Fi AP and/or cell data.



chris

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Hanno Schlichting
On 10.09.2013, at 03:46 , Gervase Markham g...@mozilla.org wrote:
 On 10/09/13 10:48, ianG wrote:
 If that is the case, why not flip it around.  Instead of trying to
 interpolate the existing data that is broadcast out there, why not write
 a protocol to broadcast the direct location from the wireless access point?
 
 Because only a tiny, tiny fraction of devices would run it, and for most
 of those, the user wouldn't have correctly set the device's location
 anyway, and for some of them, they'd have set it and then moved.
 
 This is a boil the sea approach to the problem.

In addition the CDMA cell networks actually have support for reporting the base 
stations lat/lon as part of the protocol. But in practice these are almost 
never set, as cell operators value ease of deployment and uniform configuration 
more than providing this extra service.

In another anecdote, mobile operators cannot actually give you lists of all 
their cell towers and locations - we asked our partners. Thanks to a multitude 
of subsidiaries, subcontractors and partnerships, they often don't actually 
know how many cell towers they have and where they are. The same problem 
applies to the many wifi AP's officially being operated by some large telco.

So even where this is possible, it's not actually a practically relevant 
approach.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Chris Peterson


On 9/10/13 11:53 AM, Stefan Arentz wrote:

I wonder if it makes sense to ban specific MAC address ranges (vendors) from 
appearing in this database. For example I think it would be possible to detect 
specific chipsets as being mobile devices vs stationary access points.


Our stumbler does some of this. MAC addresses encode whether a network 
is ad-hoc from another device or an infrastructure access point.


Wireshark maintains a list [1] of known vendor OUIs (MAC address 
prefixes), so we can filter out, say, HTC and Motorola MAC addresses. 
Filtering Apple's MAC addresses is trickier if we choose to collect 
desktop and laptop MAC addresses.


[1] https://anonsvn.wireshark.org/wireshark/trunk/manuf


chris
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Hanno Schlichting
On 10.09.2013, at 03:39 , Gervase Markham g...@mozilla.org wrote:
 BTW, how does the service figure out the lat/long of an AP? Do we do
 anything at all with signal strengths? Could we?

This is a bit off-topic for the security discussion.

I suggest starting a new thread on dev-geolocation, if you want to know more 
about the technical details. The short answer is: Yes, but it's a lot more 
complicated than that :)

Cheers :)
Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Stefan Arentz

On Sep 9, 2013, at 9:13 PM, Brian Smith br...@briansmith.org wrote:

 On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson cpeter...@mozilla.com wrote:
 Google's Location Service prevents people from tracking individual access
 points by requiring requests to include at least 2-3 access points that
 Google knows are near each other. This proves the requester is near the
 access points.
 
 I assume by prevents people from tracking individual access points
 means the following: Some people have a personal access point on them
 (e.g. in their phone). If somebody knows the SSID and MAC of this
 personal access point, then they could track this person's location by
 polling the database for that (SSID, MAC) pair. Google tries to limit
 this type of abuse as much as practical while providing still
 providing a location service based on such crowdsourced data.

I wonder if it makes sense to ban specific MAC address ranges (vendors) from 
appearing in this database. For example I think it would be possible to detect 
specific chipsets as being mobile devices vs stationary access points.

Also, when I tether my iPhone to my Mac, the Mac shows a different icon next to 
the network name. I think Android does the same. Maybe at a lower protocol 
level it is possible to see if an access point is a mobile device?

Is that worth investigating?

 S.

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Daniel Veditz
On 9/10/2013 3:46 AM, Gervase Markham wrote:
 On 10/09/13 00:25, R. Jason Cronk wrote:
 Does this give Mozilla the
 ability to historically track me if I move my device? 
 
 Yes; this is why publishing the full raw stumbled data sets is sadly
 going to be not possible.

Why would we have two locations for the same AP? In fact, given the
schema Chris outlined (1:1 mapping H(Mac+SSID) = location) I don't see
how we even could.

-Dan Veditz



smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread Daniel Veditz
On 9/10/2013 10:09 AM, Hanno Schlichting wrote:
 As of this moment, we filter out any AP that has been detected in two
 different places (where different means more than ~1km away from each
 other). This is very conservative approach and we'll relax that
 later.

What do you mean by filtered out? How are you tracking that it's now
been seen in multiple locations? Given the simple storage schema at the
top of the thread your choices seem limited to a) ignore the new
location info, or b) throw out the old location info. a) means no one
can ever move, and b) means the next time you see the new location that
becomes the location... over and over as it moves around.

That can't be right, so your database must be more complex. If you're
storing more than originally implied that may have some impact on a
security assessment.

-Dan Veditz



smime.p7s
Description: S/MIME Cryptographic Signature
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-10 Thread ianG

On 11/09/13 03:27 AM, Daniel Veditz wrote:

On 9/9/2013 11:21 PM, Chris Peterson wrote:

The primary motivation for hashing the MAC+SSID was to avoid uploading
the SSID (which is considered private data in some European countries)


private means we can't even /look/ at it, rather than merely can't
store it?



The data regime might be simply put as this:  you can't store a number 
suitable for tracking (or any derivative of it if that simply creates a 
new tracking number) unless you have a compelling business reason, and 
you have agreement.


The EU data protection regime makes a very strong distinction about any 
private tracking information.  It also goes to another level if you 
share that information with anyone.


The initial simple answer is, don't go there.  (I have no idea how 
google finessed this issue, or even if they didn't.)




I believe Europe also considers IP addresses private data, but
they certainly don't ban HTTP connections from giving up the IP address
to the server as part of a request.



That's because IP addresses have to be given up to the server as part of 
TCP.  A compelling case -- packets have to be returned somewhere. 
However, post-session storage is another issue, and data deletion 
practices should be in place.  Logging is where it gets vexatious.





iang
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread R. Jason Cronk

I haven't done a full analysis but do have a few questions


On 9/9/2013 5:58 PM, Chris Peterson wrote:
Our private database maps access point hash IDs to locations (and 
other metadata). Assuming:


H1 = Hash(AP1.MAC + AP1.SSID)
H2 = Hash(AP2.MAC + AP2.SSID)


I assume + means concatenate. I might suggest XORing the values. SSID 
names are usually human readable, not meant to be secure and thus follow 
predictable patterns. I also hope you're not using the patterned MAC 
notation but rather the 48 bit address space representation.





Our private database's schema looks something like:

Hash(AP1.MAC + AP1.SSID) == AP1.latitude, AP1.longitude, ...
Hash(AP2.MAC + AP2.SSID) == AP2.latitude, AP2.longitude, ...


Is the data aged? What happens if I move? Does this give Mozilla the 
ability to historically track me if I move my device? Is that a problem? 
(I'm not saying it is, just an observation).
You mention below about filtering APs in multiple locations but clearly 
they can move as people relocate.

What is the granularity of the lat/long?



Our published database would include two tables. The first table would 
map a random row id to metadata about an anonymous access point:


Random1 == AP1.latitude, AP1.longitude, ...
Random2 == AP2.latitude, AP2.longitude, ...


I would be hesitant to use the word anonymous here. Latlong is easily 
combine with other publicly available databases that could identify 
individual address and thus individuals. Again, it comes down to 
granularity of the data.




The second table's primary key would be a hash of hashes. It would map 
a hash of two neighboring access points' hash IDs to a row id of the 
first table. Something like:


Hash(H1 + H2) == Random1
Hash(H2 + H1) == Random2

Someone querying the published database would need to know the MAC 
addresses and current SSIDs of two neighboring access points to look 
up either's location.


When you say published, do you mean that the entire DB is published for 
use by researchers or that it's just has a publicly exposed API that 
responds to queries?
I'm assuming if AP3 through AP10 were all also in the vicinity that 
Hash(H1+Hx) == Random1 where x is in {2,..,10}, correct?
If so, is whatever value Hy is the prefix in the concatenation will 
correspond to APy's Random id?






btw, should we use SHA-2 instead of SHA-1? In 2009, NIST recommended 
that Federal agencies should stop using SHA-1 for applications that 
require collision resistance as soon as practical, and must use the 
SHA-2 family of hash functions for these applications after 2010.


Yes


*R. Jason Cronk, Esq., CIPP/US*
/Privacy Engineering Consultant/, *Enterprivacy Consulting Group* 
enterprivacy.com


 * phone: (828) 4RJCESQ
 * twitter: @privacymaverick.com
 * blog: http://blog.privacymaverick.com

___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Brian Smith
On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson cpeter...@mozilla.com wrote:
 Google's Location Service prevents people from tracking individual access
 points by requiring requests to include at least 2-3 access points that
 Google knows are near each other. This proves the requester is near the
 access points.

I assume by prevents people from tracking individual access points
means the following: Some people have a personal access point on them
(e.g. in their phone). If somebody knows the SSID and MAC of this
personal access point, then they could track this person's location by
polling the database for that (SSID, MAC) pair. Google tries to limit
this type of abuse as much as practical while providing still
providing a location service based on such crowdsourced data.

 Unlike Google's Location Service, our server does not store MAC addresses or
 SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID).
 To query the location of an access point in the database, you must know both
 its MAC address and current SSID.

MAC addresses are 48 bits. SSIDs are often guessable or predictable.
Therefore, using the H(MAC+SSID) instead of just the plain MAC+SSID is
not buying you much in terms of privacy, IMO. Basically, if you are
really trying to use this as a privacy mechanism then you should store
the MAC+SSID according to best practices for storing passwords. For
example, use PBKDF2 with a large number of iterations. Regardless of
whether you use SHA1, SHA2, PBKDF2, or something else, I will still
call whatever function you use H(x). But, I am not sure that switching
to PBKDF2 even buys you much improved privacy protection.

 H1 = Hash(AP1.MAC + AP1.SSID)
 H2 = Hash(AP2.MAC + AP2.SSID)

 Our private database's schema looks something like:

 Hash(AP1.MAC + AP1.SSID) == AP1.latitude, AP1.longitude, ...
 Hash(AP2.MAC + AP2.SSID) == AP2.latitude, AP2.longitude, ...

 Our published database would include two tables. The first table would map a
 random row id to metadata about an anonymous access point:

 Random1 == AP1.latitude, AP1.longitude, ...
 Random2 == AP2.latitude, AP2.longitude, ...

 The second table's primary key would be a hash of hashes. It would map a
 hash of two neighboring access points' hash IDs to a row id of the first
 table. Something like:

 Hash(H1 + H2) == Random1
 Hash(H2 + H1) == Random2

 Someone querying the published database would need to know the MAC addresses
 and current SSIDs of two neighboring access points to look up either's
 location.

If  you know the MAC+SSID of person X's personal access point and the
MAC+SSID of person Y's personal access point, then you can use this
database to ask the question are person X and person Y in the same
location? This seems bad. I see that you attempt to address this
below.

 btw, should we use SHA-2 instead of SHA-1?

There is no reason to use SHA-1 when you have SHA-2 available.
However, as I indicated above, it isn't clear it is a good idea to be
using any plain hash function as H(x).

 Other layers of privacy protection include filtering out ad-hoc Wi-Fi
 networks; MAC addresses with vendor prefixes from mobile device manufacters
 (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g.
 XXX's iPhone and Google's _nomap opt-out); and APs reported in multiple
 locations.

I think that these things are much more important than the protection
offered by H(x). My concern is that if you store the data on the
server as H(x) then you will not be able to do the above filtering on
the server unless H(x) is ineffective. That seems bad, because the
server will be much easier to update to improve the filtering than the
clients will be, AFAICT. Also, you will not be able to measure the
effectiveness of the privacy protections on the server, which is also
very bad.

Therefore, I'd suggest that you avoid using any protection at all, and
just use x instead of H(x) until we are very confident there is no way
we can further improve the filtering.

Cheers,
Brian Smith
-- 
Mozilla Networking/Crypto/Security (Necko/NSS/PSM), NSA plant
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Eric Rescorla
Chris,

I have some basic and perhaps stupid questions.

1. How do I bootstrap? I turn on my device and want to get the coordinates of 
the aps I see. That requires a lat long for neighbors. What now?

2. As asked previously will the db be published or query able?

3. What is the lat/long resolution? How is it measured?

Thanks
Ekr

On Sep 9, 2013, at 14:58, Chris Peterson cpeter...@mozilla.com wrote:

 I'm looking for some feedback on crypto privacy protections for a geolocation 
 research project I'm working on with the Mozilla Services team. If you have 
 general questions or suggestions about the project, I'm happy to answer them, 
 but I'd like to focus this thread on crypto.
 
 Our team is prototyping a crowd-sourced version of Google's Street View cars 
 to correlate Wi-Fi access points and cell towers to GPS positions. Our 
 primary motivation is to provide non-proprietary location services for 
 Firefox OS devices. We would also like to publish this location data for 
 researchers or other projects that might have novel uses for it.
 
 Google's Location Service prevents people from tracking individual access 
 points by requiring requests to include at least 2-3 access points that 
 Google knows are near each other. This proves the requester is near the 
 access points.
 
 Below is a sketch of a scheme that I think will allow us to publish a 
 database of access point locations while still requiring knowledge of two 
 neighboring access points.
 
 Unlike Google's Location Service, our server does not store MAC addresses or 
 SSIDs. We identify access points by hash IDs, specifically SHA1(MAC+SSID). To 
 query the location of an access point in the database, you must know both its 
 MAC address and current SSID.
 
 Our private database maps access point hash IDs to locations (and other 
 metadata). Assuming:
 
H1 = Hash(AP1.MAC + AP1.SSID)
H2 = Hash(AP2.MAC + AP2.SSID)
 
 Our private database's schema looks something like:
 
Hash(AP1.MAC + AP1.SSID) == AP1.latitude, AP1.longitude, ...
Hash(AP2.MAC + AP2.SSID) == AP2.latitude, AP2.longitude, ...
 
 Our published database would include two tables. The first table would map a 
 random row id to metadata about an anonymous access point:
 
Random1 == AP1.latitude, AP1.longitude, ...
Random2 == AP2.latitude, AP2.longitude, ...
 
 The second table's primary key would be a hash of hashes. It would map a hash 
 of two neighboring access points' hash IDs to a row id of the first table. 
 Something like:
 
Hash(H1 + H2) == Random1
Hash(H2 + H1) == Random2
 
 Someone querying the published database would need to know the MAC addresses 
 and current SSIDs of two neighboring access points to look up either's 
 location.
 
 btw, should we use SHA-2 instead of SHA-1? In 2009, NIST recommended that 
 Federal agencies should stop using SHA-1 for applications that require 
 collision resistance as soon as practical, and must use the SHA-2 family of 
 hash functions for these applications after 2010.
 
 Other layers of privacy protection include filtering out ad-hoc Wi-Fi 
 networks; MAC addresses with vendor prefixes from mobile device manufacters 
 (e.g. Apple and HTC); SSIDs commonly associated with mobile devices (e.g. 
 XXX's iPhone and Google's _nomap opt-out); and APs reported in multiple 
 locations.
 
 
 thanks,
 chris
 ___
 dev-security mailing list
 dev-security@lists.mozilla.org
 https://lists.mozilla.org/listinfo/dev-security
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Hanno Schlichting
On 09.09.2013, at 18:41 , Eric Rescorla e...@rtfm.com wrote:
 1. How do I bootstrap? I turn on my device and want to get the coordinates of 
 the aps I see. That requires a lat long for neighbors. What now?

We build the database by having people use a stumbler application to sent us 
observations. The stumbler app uses the mobile phones GPS sensor to know its 
location. It reports all cell towers and wifi APs it sees to us in a certain 
location. We crunch some data, then we make a search API available over this 
data. Later someone else asks us what their location is, based on seeing cell 
towers or APs.

 2. As asked previously will the db be published or query able?

It will definitely be queryable, but with a lot of restrictions to enhance 
privacy. We would like to publish it or as much of it as possible, but it's 
unclear how to do that, when a lot of the individual records are considered 
personally identifiable information.

 3. What is the lat/long resolution? How is it measured?

The resolution differs, but is generally as precise as it gets. So GPS 
sensors often have 5 meter precision, Google aims to do 1 meter resolution for 
indoor locations based on Wifi access points. Internally we currently store 
things with centimeter precision and timestamps in milliseconds - so definitely 
all on the far side of extremely detailed / private.

Hanno
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Brian Smith
On Mon, Sep 9, 2013 at 7:15 PM, Hanno Schlichting
hschlicht...@mozilla.com wrote:
 On 09.09.2013, at 18:13 , Brian Smith br...@briansmith.org wrote:
 On Mon, Sep 9, 2013 at 2:58 PM, Chris Peterson cpeter...@mozilla.com wrote:

 [T]here's one crucial difference between Google and us: We would
 like to make as much of this data public as possible, while Google will always
 just provide a service without access to the underlying data.

 We were looking for two things with using the sha1:

 - Make it possible for the end-user to change their unique value (they cannot 
 change the mac address, but they can change the ssid). This allows them to 
 invalidate historical records in the database.

There is friction in changing SSIDs as it affects every device that
would connect to that network. There will also probably not be much
awareness among users of when/why/how to do this or what effect it
will have. So, I think this is an aspect that sounds great in theory,
but in practice will nearly never be used.

 - Make it harder for spammers to guess actual unique keys and flood our 
 service. Mac addresses have a vendor prefix, which makes it rather easy to 
 generate lots of valid mac addresses. Taking the ssid into account makes it 
 harder to generate valid keys. Unfortunately the ssid itself is considered 
 private data in European countries, so you aren't allowed to store it without 
 the users consent. That's why Google and everyone else has stopped storing 
 them and only use mac addresses now.

 The sha1 scheme might be ineffective in doing this.

If x is private data then SHA1(x), SHA2(x), PBKDF2(x), and even
AES256(x, key) with a key known to you are all private data too.

 Therefore, I'd suggest that you avoid using any protection at all, and
 just use x instead of H(x) until we are very confident there is no way
 we can further improve the filtering.

 This sounds like good advice and I'm starting to lean into this direction.

 But this only helps us on the we provide a service side. It's still unclear 
 to me if and how we could share any of this data as database dumps.

If you wanted to publish this data, and the data was stored in its raw
state, then you could always apply whatever mapping (SHA2, PKBKFD2,
AES256 with random and thrown-away key, etc.) right before you share
the data.

Even if you use AES256 with a random, thrown-away key, the data will
be subject to reverse engineering. For example, one could correlate a
subset of the data with a separate database of known
(MAC,SSID,Location) triples, and/or attempt traffic analysis to see
relationships in how (MAC,SSID) pairs interact with each other with
respect to location. You have probably heard of the Netflix Prize
privacy issues [1]; this is a very similar problem to the Netflix
prize. Therefore, while it may be important to obscure the data before
giving it to researchers, we should still consider the obscured data
to be highly-sensitive confidential user data.

[1] http://en.wikipedia.org/wiki/Netflix_Prize#Privacy_concerns

Cheers,
Brian
-- 
Mozilla Networking/Crypto/Security (Necko/NSS/PSM)
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security


Re: Request for feedback on crypto privacy protections of geolocation data

2013-09-09 Thread Chris Peterson

On 9/9/13 4:25 PM, R. Jason Cronk wrote:

On 9/9/2013 5:58 PM, Chris Peterson wrote:

Our private database maps access point hash IDs to locations (and
other metadata). Assuming:

H1 = Hash(AP1.MAC + AP1.SSID)
H2 = Hash(AP2.MAC + AP2.SSID)


I assume + means concatenate. I might suggest XORing the values. SSID
names are usually human readable, not meant to be secure and thus follow
predictable patterns. I also hope you're not using the patterned MAC
notation but rather the 48 bit address space representation.


We currently use concatenation, but I see how XOR would make more sense. 
We are using the SSID as a weak protection against someone polluting 
our database results by submitting random MAC addresses. Our database 
still might have their junk data, but real location requests shouldn't 
hit them.


We are using the MAC string notation like 45:67:89:ab:cd:ef, but I see 
that this format has predictable patterns, too. I will recommend we use 
the 48-bit binary representation.




What is the granularity of the lat/long?


This depends on the GPS of the device used to collect the data, but our 
database stores 7 decimal places (less than one meter resolution).




Someone querying the published database would need to know the MAC
addresses and current SSIDs of two neighboring access points to look
up either's location.


When you say published, do you mean that the entire DB is published for
use by researchers or that it's just has a publicly exposed API that
responds to queries?


We are investigating both a web service API and a downloadable database. 
We are collecting position data for both Wi-Fi access points and cell 
towers. Depending on privacy protections, if we can't publish the whole 
database to the world, we can publish just the cell tower data to the 
world and possibly make the Wi-Fi data available only to trusted 
researchers.




I'm assuming if AP3 through AP10 were all also in the vicinity that
Hash(H1+Hx) == Random1 where x is in {2,..,10}, correct?
If so, is whatever value Hy is the prefix in the concatenation will
correspond to APy's Random id?


In the proposed scheme, yes. Since AP1 and AP2 have different (but 
close) latitude and longitude positions, Hash(H1+H2) would fetch the 
random row id for AP1's location and Hash(H2+H1) would fetch the row id 
for AP2's location.



chris
___
dev-security mailing list
dev-security@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-security