On 02.09.2014, at 19:52 , Chris Peterson <[email protected]> wrote: > Why hourly diffs? Daily updates seems adequate. Are the hourly updates diffs > from the most recent full update or the previous hourly update?
Hourly updates are a compromise we reached with OpenCellID. They preferred a live streaming API they could subscribe to and get more or less instant updates. The primary motivation here was to avoid spikes of new incoming data to avoid having to stand up hardware to be able to handle the peak loads. Since such a streaming / subscribe API would have been much more difficult to implement on our side, we settled on hourly differential updates as a small enough unit of work. The diff updates cover one exact hour. So a file named …T160000.csv.gz includes all cells modified between 15:00 and 16:00 UTC (excluding the exact minute of 16:00). The full exports don’t have a time restriction, so include anything up to the point the export file was created, which might be slightly more data than that created “yesterday”. Practically you can take the full export file with a 2014-09-03 date and apply the hourly diff files from September 3rd on top of it to reach the current state. The files are purely additional and new rows always overwrite old rows. > Why are the updates merely gzipped? xz compression has much better results: > > MLS-full.csv 135M > MLS-full.csv.gz 35M > MLS-full.csv.bz2 32M > MLS-full.csv.xz 26M OpenCellID already uses gzip compression for all their data, so we kept it. Gzip support is also much more commonly available, for example built in to the version of Python we are using. I imagine the same applies to the version of Java OpenCellID uses. > I'm surprised there is so little overlap, but that's good news for both > projects. Are most of the OpenCellID networks in Europe? I look forward to > seeing MLS' update coverage map. :) OpenCellID has a number of different statistics and visualizations. For example per country stats at http://opencellid.org/#action=statistics.cells&type=1&dateFrom=&dateTo=&mcc=&mnc=&sortBy=1 similar to what we have at https://location.services.mozilla.com/stats/countries Hanno _______________________________________________ dev-geolocation mailing list [email protected] https://lists.mozilla.org/listinfo/dev-geolocation
