Re: [gdal-dev] Simple schema support for GeoJSON

Andreas Oxenstierna Fri, 21 Nov 2014 07:25:42 -0800

Hi

The normal reason to select GeoJSON for geoweb applications is that JSONis parsed directly by the web browser, i.e. you get JavaScript objectsdirectly digestable by your JavaScript code. This may be alsoconsiderable faster than parsing XML.

Bandwidth is more or less irrelevant in comparison.

Le vendredi 21 novembre 2014 15:35:43, Rahkonen Jukka (Tike) a écrit :

Hi,

I have no use for this feature myself but by reading various mailing lists
and forums I have learned that many people consider it is always a good
idea to read data for example from WFS services as GeoJSON instead of GML.

Because it consumes less bandwidth ?

For the record, if you try the following, it will use the GML schema for the 
user
exposed layer and will do a on-the-fly transform from the hidden GeoJSON layer 
schema
to the GML schema, similarly to the one you could do with a CAST/VRT.

$ ogrinfo 
"WFS:http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request=getfeature&typename=topp:states&outputformat=json";
 -ro -al -where "STATE_NAME = 'California'"

Layer name: topp:states
Geometry: Multi Polygon
Feature Count: 1
Extent: (-124.391472, 32.535725) - (-114.124451, 42.002346)
Layer SRS WKT:
GEOGCS["WGS 84",
     DATUM["WGS_1984",
         SPHEROID["WGS 84",6378137,298.257223563,
             AUTHORITY["EPSG","7030"]],
         AUTHORITY["EPSG","6326"]],
     PRIMEM["Greenwich",0,
         AUTHORITY["EPSG","8901"]],
     UNIT["degree",0.0174532925199433,
         AUTHORITY["EPSG","9122"]],
     AUTHORITY["EPSG","4326"]]
gml_id: String (0.0)
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Real (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Real (0.0)
CARPOOL: Real (0.0)
PUBTRANS: Real (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Real (0.0)
SERVICE: Real (0.0)
MANUAL: Real (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Real (0.0)
OGRFeature(topp:states):0
   gml_id (String) = (null)
   STATE_NAME (String) = California
   STATE_FIPS (String) = 06
   SUB_REGION (String) = Pacific
   STATE_ABBR (String) = CA
   LAND_KM (Real) = 403970.143
   WATER_KM (Real) = 20023.368
   PERSONS (Real) = 29760021
   FAMILIES (Real) = 7139394
   HOUSHOLD (Real) = 10381206
   MALE (Real) = 14897627
   FEMALE (Real) = 14862394
   WORKERS (Real) = 11306576
   DRVALONE (Real) = 9982242
   CARPOOL (Real) = 2036025
   PUBTRANS (Real) = 685797
   EMPLOYED (Real) = 13996309
   UNEMPLOY (Real) = 996502
   SERVICE (Real) = 3664771
   MANUAL (Real) = 1798201
   P_MALE (Real) = 0.501
   P_FEMALE (Real) = 0.499
   SAMP_POP (Real) = 3792553
   MULTIPOLYGON (((....)))

I can easily imagine that there will be troubles with guess-by-data method
if they are making subsequent requests from the service. For example
strings which are all numbers but which may contain leading zeroes are
saved either to integers or strings  if leading zeroes are interpreted
right at all.

In JSON, "00123" and 00123 are different objects. So a string with leading zeros should be 
serialized as "00123" and not 00123. If it is serialized as "00123", the GeoJSON driver 
will interpret it as a
string.

Or floats which do not always contain decimals, or list
attributes which sometimes have only zero or one member.

Yes, those cases could cause issues.

Embedded schema feels optimal because then it would always travel together
with the data and we all have probably lost .tfw or .prj files sometimes.

-Jukka-

Even Rouault wrote:

Jukka,

Data type guessing implemented in the OGR GeoJSON driver is quite natural
hopefully.
A whole scan of the GeoJSON file is made and the following rules are
applied : - if an attribute has integer-only content --> Integer
- if an attribute has an array of integer-only content  --> IntegerList
- if an attribute has integer or floating point content --> Real
- if an attribute has an array of integer or floating point content -->
RealList - if an attribute has an array of anything else content -->
StringList - otherwise --> String

With RFC 50 and other pending improvements in the driver:
- if an attribute has boolean-only content --> Integer(Boolean)
- if an attribute has an array of boolean-only content -->
IntegerList(Boolean) - if an attribute has date-only content --> Date
- if an attribute has time-only content --> Time
- if an attribute has datetime or date content --> DateTime

I'm not sure we want to invent a .jsont format, but if you download
http://svn.osgeo.org/gdal/trunk/gdal/swig/python/samples/ogr2vrt.py

and run  :

python ogr2vrt.py
"http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&request
=getfeature&typename=topp:states&outputformat=json" test.vrt

This will create you a VRT with the default schema, that you can easily
edit. Note: as with OGR SQL CAST, this is post processing. So if the
guess done by the GeoJSON driver leads to a loss of information, you
cannot recover it. Hopefully the implemented rules will not lead to
information loss.

A better approach would be to have the schema embedded in a JSON way in
the GeoJSON file itself.
That could be an evolution of the format, but I'm not sure this would be
really popular, given JSON/GeoJSON is heavily used by NoSQL
approaches...

Hum, doing a quick search, I just found http://json-schema.org/ that
appears to be an IETF draft.
It doesn't look that the schema is embedded in the data file itself.

There's also GeoJSON-LD that might be a bit related :
https://github.com/geojson/geojson-ld

CC'ing Sean in case he has thoughts on this.

Even

Hi,

I wonder if GDAL could have some simple and relatively user friendly
way for defining a schema for GeoJSON data. The GeoJSON driver seems
to guess the data types of attributes with some undocumented way but
users could have better knowledge about the desired schema.

I know I can control the data type by using OGR SQL and CAST as in
ogrinfo -sql "select cast(EMPLOYED as float) from OGRGeojson"
states.json -so

However, perhaps GeoJSON is enough popular for deserving an easier way
for writing a schema. First I thought that it would be enough to copy
the "csvt" text file mechanism from the GDAL CSV driver
http://www.gdal.org/drv_csv.html. However, the csvt file is a plain
list of types which will be applied to the attributes in the same
order than they appear in the text file
"Integer(5)","Real(10.7)","String(15)"

For GeoJSON it would feel more user friendly to include the attribute
names in the list somehow like
"population;Integer(5)","area;Real(10.7)","name;String(15)".

This would make it easier for users to write a valid "jsont" file. A
list with attribute names could perhaps also help GDAL as well because
the features in GeoJSON file do not necessarily have same attributes.

As an example this is the right schema for a WFS feature type which is
captured from
http://demo.opengeo.org/geoserver/wfs?service=wfs&version=1.0.0&reques
t=des
cribefeaturetype&typename=topp:states


name="the_geom" type="gml:MultiPolygonPropertyType"/>
name="STATE_NAME" type="xsd:string"/>
name="STATE_FIPS" type="xsd:string"/>
name="SUB_REGION" type="xsd:string"/>
name="STATE_ABBR" type="xsd:string"/>
name="LAND_KM" type="xsd:double"/>
name="WATER_KM" type="xsd:double"/>
name="PERSONS" type="xsd:double"/>
name="FAMILIES" type="xsd:double"/>
name="HOUSHOLD" type="xsd:double"/>
name="MALE" type="xsd:double"/>
name="FEMALE" type="xsd:double"/>
name="WORKERS" type="xsd:double"/>
name="DRVALONE" type="xsd:double"/>
name="CARPOOL" type="xsd:double"/>
name="PUBTRANS" type="xsd:double"/>
name="EMPLOYED" type="xsd:double"/>
name="UNEMPLOY" type="xsd:double"/>
name="SERVICE" type="xsd:double"/>
name="MANUAL" type="xsd:double"/>
name="P_MALE" type="xsd:double"/>
name="P_FEMALE" type="xsd:double"/>
name="SAMP_POP" type="xsd:double"/>


This is what GDAL is guessing:
STATE_NAME: String (0.0)
STATE_FIPS: String (0.0)
SUB_REGION: String (0.0)
STATE_ABBR: String (0.0)
LAND_KM: Real (0.0)
WATER_KM: Real (0.0)
PERSONS: Real (0.0)
FAMILIES: Integer (0.0)
HOUSHOLD: Real (0.0)
MALE: Real (0.0)
FEMALE: Real (0.0)
WORKERS: Real (0.0)
DRVALONE: Integer (0.0)
CARPOOL: Integer (0.0)
PUBTRANS: Integer (0.0)
EMPLOYED: Real (0.0)
UNEMPLOY: Integer (0.0)
SERVICE: Integer (0.0)
MANUAL: Integer (0.0)
P_MALE: Real (0.0)
P_FEMALE: Real (0.0)
SAMP_POP: Integer (0.0)
bbox: RealList (0.0)

-Jukka Rahkonen-

_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev

--
Spatialys - Geospatial professional services http://www.spatialys.com

_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev



--
Hälsningar

Andreas Oxenstierna
T-Kartan Produkt AB
mobile: +46 733 206831
mailto: [email protected]
http://www.t-kartor.com

_______________________________________________
gdal-dev mailing list
[email protected]
http://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] Simple schema support for GeoJSON

Reply via email to