Hi Adrian, thank you for spending the time and doing a detailed review of the format. It seems I have not communicated the use-case and scope of this format adequately.
There are three main use-cases for the data format today: 1. Allow a data exchange between OpenCellID and MLS 2. Allow anyone to download the data and run their own instance of ichnaea with the same data we have in MLS 3. Allow community members to help validate, visualize and play with the current data As the first use-case was the driving force behind this, we took the existing OpenCellID export format and adjusted it in minor ways to satisfy both of our projects needs. This included the addition of the range and unit (psc/pci) fields and a shorter way to state the unique logical identifier for each cell network (consisting of five radio type specific fields). We kept the field naming as much as possible to the old format, but generally the field names weren’t important to us. With a CSV file the field position is determining the field value and we could have named the fields field1, field2, etc. or gone without a header row. In general this format is not at all trying to be an official internet standard, but merely a first step at a practical collaboration between OpenCellID and MLS, which when proven successful might start others to be interested in the data and expose new use-cases. At which point we can refine the format or come up with something better. I have tried to take some of your more concrete and actionable feedback and addressed it with documentation improvements in https://github.com/mozilla/ichnaea/commit/904731187427ba6595231855d3d693e9c9d4205d. What we intentionally didn’t specify with the format was the algorithms, filtering and validation rules that lead to the position, range and average signal estimates. Currently both OpenCellID and MLS use similar but in the details different algorithms. OpenCellID also has a huge amount of older data that has not been filtered much, which is something they are planning to fix later this year. They’ll reset their aggregate data and rebuild it based on the underlying observation data and newer filtering and validation rules. Today you get basically a slightly more standardized SQL dump of the data we have in MLS. It’s not more but also not less than that. I consider the questions on position estimations and algorithms outside the scope of the data exchange format. Currently you need to remember the source of the data file, which will determine the algorithms in-use. For MLS those can be looked up from the ichnaea source code. I’m planning to add more documentation about the details of this later. Hanno On 26.08.2014, at 20:09 , Adrian Custer <[email protected]> wrote: > Hello all, > > Hanno's email was timely since I was about to send a mail asking about this > API effort. I do have "Feedback and Concerns"; here are some. > > > > At the process level, a week before launch is *very* late to be asking for > feedback on a public API! The GeoLocation Web API and the Mozilla Location > Service upload API both would have benefited from some good, structured logic > to avoid their bad structure and naming. I guess the discussion on this API > was all happening on GitHub issues and not on this list. For something 'a > long time in the planning' though, this public notice is sadly late. > > > > The Internet end of line separator is [carriage-return, line feed] as per all > IETF standards. > The Text/Plain media type is the lowest common denominator of > Internet email, with lines of no more than 998 characters (by > convention usually no more than 78), and where the carriage-return > and line-feed (CRLF) sequence represents a line break (see [MIME-IMT] > and [MSG-FMT]). > http://tools.ietf.org/html/rfc3676 > This started with email, was kept for HTTP, and, absent strong reasons to > change it, sticking with the standard is best policy. > > > > The proposed API needs work: the semantics are a mishmash and the naming is > terrible. The page name 'import-export' is in direct conflict with the API > structure which appears to only have been thought out for export. Here I only > discuss export because developing an API that can serve both will take way > more time and we might as well start somewhere. Nonetheless, the clearer the > name and documentation, the more reusable the element. > > > Semantically, the API is offering a set of individual data records each of > which consists of: > *a set of labels which jointly identify an individual antennae > *known properties of the antennae > *measurements of that antennae > *estimated properties of that antennae > *record metadata > Unfortunately, neither the names nor the documentations properly separate out > these roles. > > > Let's walk the proposal: > > > > 'mcc' okay but -> 'mobCountryId' to match others > 'net' bad name -> 'mobProviderId' > 'area' bad name -> 'mobAreaId' > 'cell' bad name -> 'mobCellId' > 'unit' bad name -? 'mobSubUnitId' > > Come on! 'net', 'area', 'cell', 'unit' have generic meaning in the world that > has nothing to do with your API. Put in a little more effort to your naming, > please! Save your users some headaches. > > All of these seem, from what I can tell, to be code identifiers which, > JOINTLY, label the specific radio antennae which is the subject of the data > record. Semantically we really have > 'antenneId' : 'mcc'&'net'&'area'&'cell'&'unit' > but here we use five fields instead of one. Fine, but this needs clear > documentation. In a JSON API these would properly be jointly in a > sub-structure but since this is a flat API we just need clarity in the > documentation stating that jointly these identifiers will provide a unique > label for each record. > > Ideally, these names would all have form 'id...' but English places its > adjectives first 'red car' (versus 'voiture rouge' in French or 'coche rojo' > in Spanish) so we end up with a structure '...Id'. My proposed shared prefix > 'mob' helps clarify these fields are all similar and work jointly. > > > > > 'radio' bad name -> 'radioClass' or some such > > The current MozLocService item upload has a similar crappy naming approach > where each item has a 'radio' element but then each observed cell in the item > also has its own 'radio' element, of course with different data. So I have to > have this ridiculous lookup object: > var CELL_TYPE_LOOKUP = { > 'type': ['cell.radio', 'item.radio'],//Header field > > 'gsm': ['gsm', 'gsm'], //1G GSM > 'edge': ['gsm', 'gsm'], //2G EDGE > 'gprs': ['gsm', 'gsm'], //2G GPRS > 'umts': ['umts', 'gsm'], //3G UMTS > 'hspa': ['umts', 'gsm'], //3.5G HSDPA > 'hsdpa': ['umts', 'gsm'], //3.5G HSDPA > 'hspa+': ['umts', 'gsm'], //3.5G HSDP+ > 'hsupa': ['umts', 'gsm'], //3.5G HSDPA > > 'cdma': ['cdma', 'cdma'], //1G CDMA > 'is95a': ['cdma', 'cdma'], //2G CDMA > 'is95b': ['cdma', 'cdma'], //2G CDMA > '1xrtt': ['cdma', 'cdma'], //2G CDMA > 'evdo0': ['cdma', 'cdma'], //3G CDMA > 'evdoa': ['cdma', 'cdma'], //3G CDMA > 'evdob': ['cdma', 'cdma'], //3G CDMA > 'ehrpd': ['cdma', 'cdma'], //4G CDMA > > 'lte': ['lte', 'gsm'] //4G LTE > } > to generate what is required. I take it this proposed API element is the > middle column. I have taken to naming the first column 'radioType,' the > second 'radioClass', and the third 'radioFamily' but these names are > arbitrary. First you need to decide on your name and then you need a bunch > more documentation providing essentially this lookup table to explain this to > users. > > > > > > 'lon' > 'lat' > > The documentation should mention the Coordinate Reference System for these as > being the CRS used by the GPS system, i.e. WGS84. "The prime meridian is 0 > degrees" is a tautology---that's what 'prime meridian' means. More properly, > this could be "The Prime Meridian (with value 0 degrees) is the IERS > Reference Meridian, close to, but not the same as, the Greenwich Airy > Meridian." > https://en.wikipedia.org/wiki/World_Geodetic_System > http://spatialreference.org/ref/epsg/4326/ > > > 'changeable' terrible name > > As far as I can tell, this only applies to the location of the antennae so > the name needs to be linked to the position. From the consumer stand point, > the only thing interesting is how the position has been 'determined': either > defined or estimated, and if the latter probably the user wants some notion > of how it was estimated. This could be done in a single field or in two, > depending on what you want > > -> 'posEstimationMethod' DEFINED || CENTROID || ALGO_6 > or > -> 'posDetermination' DEFINED || ESTIMATED > -> 'posEstimationAlgo' MEASURED|| CENTRIOID || ... > > > The best way, given the variety of algorithms possible, would be to define a > few and then use an HTTP URI (i.e. an URL) for the rest where the link is to > a web page with the description of the estimation algorithm or process. > Otherwise the documentation needs some indication of how the position > estimation were derived. > > Are you punting completely on giving any estimate of the accuracy of the > position? I would expect a > > 'posAccuracy' > > giving a 95% CI radius around the observation since that is the crucial > factor which makes the position usable or not. (The only other element of > your API that would let me guess as to the quality of the data would be the > number of observations but this does not let me know if they were all in a > line or were well distributed spatially.) Since the service, which has all > the data, is the only one who can properly make this estimate, it seems this > should be generated for each record. > > > > > 'range' bad name -? 'rangeEstimate' > > Conceptually, this is an estimate of the distance at which the signal level > drops below some particular strength, perhaps usable strength. So the > documentation should explain that. Of course, for different radio > technologies the threshold strength is probably different, so what is this > really? Is this a property of the radioClass or is this an estimate based on > the observations? > > > 'samples' bad name -> 'obsNumber' or 'numObs' or 'numSamples' > > The name 'samples' suggests it is the samples themselves but it is actually a > number. The text says it is the number of observations used to determine the > position but we have already seen the position might have been defined. So > the documentation needs to be clear what other entries are based on these > observations: i.e. the 'range' or 'averageSignal'. > > > 'averageSignal' -> !? > > Ouch. Hmm. What is this telling us about? Is this to help us estimate the > quality of the observations or to help us estimate the quality of the > position estimate? 'Max', 'Median', and 'Min' might help with the former; > some kind of referent of 'MaxEverForRadioClass' and 'MinEverForRadioClass' in > the documentation would be needed for the latter. A straight mathematical > average for a 2D spatial estimate is crazy problematic to interpret a > posteriori so I am really not sure what this is supposed to provide users. > Some clarity of the usage of this number and its behviour in the field is > needed in the documentation. > > > > 'created' > 'updated' > > Are these purely database modification times or are these related to the > observations? If the latter, 'firstObserved' and 'lastObserved' would be > better names. > > Why make it an ambiguous timestamp, when you can make it an unambiguous ISO > 8601 Date (e.g. 2014-07-24T12:16:36Z)? > > > > > > This is not the API I would have expected. > > Without one or a few ways to estimate the accuracy of the position, these > records are of little use for positioning. Without a richer description of > the spatial structure of the observations, like bounding boxes or partial > bounding boxes, these records are of little use in defining the quality of > the overall database. So we are left with being able to get summary records > which neither provide a well defined estimate of position and other values > nor provide a rich summary of the data. As it stands, this API encourages > direct, uncritical use of the positions; since OpenCellId estimates several > antennae as being in the middle of the ocean, this is not great. > > Have you developed a set of usage examples for this API? Are those written up > some where? What is the goal of such usages? I have a difficult time guessing > as to the motivations which led to such an API. > > ~adrian > _______________________________________________ > dev-geolocation mailing list > [email protected] > https://lists.mozilla.org/listinfo/dev-geolocation _______________________________________________ dev-geolocation mailing list [email protected] https://lists.mozilla.org/listinfo/dev-geolocation
