Oh :) forgot that I need to use the contextmenu to reply to the list in Thunderbird and only send my last reply to this discussion to Michael's email address,
This is what he replied to me:

Am 13.07.2014 15:43, schrieb Michael von Glasow:
Hi Felix,

Is it by intention that your reply went to me personally but not to the list? If not, feel free to forward my reply to the list.

On 11/07/14 19:18, Felix Baumann wrote:
If we use hashes than we could reduce the time needed to check all of them by filtering out inappropriate hashes beforehand. Only entries in the same time-zone, country or in range of the devices IP will be checked then.
Does that mean the downloadable database should include information such as country or time zone? Also, I don't quite understand what you mean by "in range of the device's IP". WiFi-based geolocation uses the BSSID, i.e. the MAC address, of access points, which is one layer lower than IP. In most cases we never obtain any IP addresses from these access points, as most of them are secured – we can see and identify them but not connect to them, but if we know their coordinates, we can use their identification data for geolocation.

While I would sort out details on optimizing the performance of a database search later and for now focus on the question of how to provide a database download without sacrificing privacy, I believe looking up a hash in a database is a fairly straightforward operation, and the extra cost of filtering data would largely cancel out the benefits of a quicker lookup.

The principle of a hash, as previously discussed here, is to prevent lookups based on the BSSID alone. Records are not identified by the plain BSSID but a salted hash of the BSSID. The salt would be a piece of extra information that is easy to obtain when one is in the vicinity of the BSSID but hard to guess otherwise. Therefore, if I want to look up my position and have that extra information, I can easily calculate the hashes of nearby BSSIDs and look up their position. However, if I want to stalk someone and want to use their BSSID to determine where they have moved, I would need to guess all possible hashes for their BSSID. Our design goal is to make that guesswork impractical.

Now if we had a way to pre-filter data, I would worry that this would reduce the number of records which a malicious user would have to search, thus making their life easier, without providing much of an improvement for legitimate uses.

Taking paranoia one step further, I would even start wondering if any possibility to filter the database based on coordinates could be useful for rogue users. A stalker might be able to narrow down the area in which their would-be victim is likely to be found to a country, state or even just a metropolitan area, then filter the database for the relevant records and be able to operate on a much smaller set of data.

Here Sam's proposal would really come in handy, as it gives out lat/lon individually but never in pairs, thus making that kind of filtering less effective. Filtering data can only be done by lat/lon boundaries, ruling out more sophisticated constructs such as polygons, and filtered data would still contain a lot of records whose latitude is inside the target area but the longitude is not, or vice versa. Another option would be to use not only salted hashes, but also encrypted coordinates (the salt for the hash and the encryption key could be derived from the same information).
I'm not sure whether we need to hash cells but if not we could use the nearby cells to get an even more accurate position.
I don't think we need to hash cells – they contain no private data, and the general consensus seems to be that cells and their locations can be given out in plain.

I had considered using nearby cells for the salt – there are situations in which there is just one WiFi in range, but in most cases there will be a cellular connection. There are, however, two issues with cells:

Many devices on the market don't expose all cells in range through their API, so a geolocation service running on the device might only be able to obtain the currently serving cell. That means we need to keep multiple hashes for each WiFi – one for each cell whose range it touches.

Most locations, especially in urban areas, are in the range of multiple cells. At my home, I frequently get handed over between three different cells of my carrier – and these are just the 3G cells. There are also 2G cells and 4G cells in range. And, finally, the area in which I live is served by four different carriers (a huge share of countries have somewhere between 2 and 4 carriers). That means my home WiFi would need 36 different hashes (3 cells × 3 standards × 4 carriers), which would bloat the database. Even a more conservative estimate (2 cells, 2 standards, 1 carrier) would still require four different hashes for one BSSID.


Some new questions of mine:

Using cells to hash wifis could be a real issue but it would be a safe method. (so you need at least one wifi and one cell) but what about 2 wifi aps but no cell? (too rare or do we need to use wifis to hash other wifis, too?)

Is it possible to compress such a hash database?

How often would such a database download be updated? (once per week?)

how accurate do we want to make the lat/lon coordinates? (street-/city-/.../countrylevel) or as accurate as possible?

Regards,
Felix
_______________________________________________
dev-geolocation mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-geolocation

Reply via email to