On 02.09.2014, at 19:52 , Chris Peterson <[email protected]> wrote:
> Why hourly diffs? Daily updates seems adequate. Are the hourly updates diffs 
> from the most recent full update or the previous hourly update?

Hourly updates are a compromise we reached with OpenCellID. They preferred a 
live streaming API they could subscribe to and get more or less instant 
updates. The primary motivation here was to avoid spikes of new incoming data 
to avoid having to stand up hardware to be able to handle the peak loads.

Since such a streaming / subscribe API would have been much more difficult to 
implement on our side, we settled on hourly differential updates as a small 
enough unit of work.

The diff updates cover one exact hour. So a file named …T160000.csv.gz includes 
all cells modified between 15:00 and 16:00 UTC (excluding the exact minute of 
16:00).

The full exports don’t have a time restriction, so include anything up to the 
point the export file was created, which might be slightly more data than that 
created “yesterday”. Practically you can take the full export file with a 
2014-09-03 date and apply the hourly diff files from September 3rd on top of it 
to reach the current state. The files are purely additional and new rows always 
overwrite old rows.

> Why are the updates merely gzipped? xz compression has much better results:
> 
> MLS-full.csv   135M
> MLS-full.csv.gz         35M
> MLS-full.csv.bz2  32M
> MLS-full.csv.xz   26M

OpenCellID already uses gzip compression for all their data, so we kept it. 
Gzip support is also much more commonly available, for example built in to the 
version of Python we are using. I imagine the same applies to the version of 
Java OpenCellID uses.

> I'm surprised there is so little overlap, but that's good news for both 
> projects. Are most of the OpenCellID networks in Europe? I look forward to 
> seeing MLS' update coverage map. :)

OpenCellID has a number of different statistics and visualizations. For example 
per country stats at 
http://opencellid.org/#action=statistics.cells&type=1&dateFrom=&dateTo=&mcc=&mnc=&sortBy=1
 similar to what we have at 
https://location.services.mozilla.com/stats/countries

Hanno
_______________________________________________
dev-geolocation mailing list
[email protected]
https://lists.mozilla.org/listinfo/dev-geolocation

Reply via email to