Re: Minimizing data volume

Frans Knibbe | Geodan Tue, 10 Sep 2013 12:12:47 -0700

On 10-9-2013 13:33, Andy Turner wrote:

Hi,
At least these two OGC standards might be worth having a look at inthis context:
http://www.opengeospatial.org/standards/geosparql

http://www.opengeospatial.org/standards/tjs
The latter is a Georeferenced Table Joining Service ImplementationStandard. In the development of this a lot of thought went in todifferent kinds of linking of geographical data. Sorry, but I knowvery little about the GeoSPARQL standard.
The notion of keeping geometry data separate and providing metadataabout geometries in standard forms is useful. For vector data, thenumber of points in the geometry is one of the key attributes anapplication might consider before pulling that geometry. (Also thesize of its representation in bytes -- both compressed anduncompressed is useful info too -- thanks Leigh.)
So, for vector data, the attributes for individual vectors (almostlike features) can be kept separate from the spatial geometries, andsome linkage code can be used to join the data together. Yes, thereare advantages in terms of storage organisation for keeping attributesand geometries separate, but for many applications some attributes ofthe geometries are also wanted, this geometrical metadata is importantto think about. Computationally some of it can be hard to calculate,so once calculated it is perhaps worth storing in optional metadata.
Individual points with a single attribute, where the point is definedwith respect to axes in some geographical coordinate and projectionsystem are simple geo-vectors. Lines built from multiple such points(and equations) are more detailed/complex, yet these can have simplyattributed generalised point representations (the location of asmallest circle/sphere encompassing all the points in the line --perhaps with a measure of the radius of this). There are similarthings for regional polygons in two and three dimensions.
With lines and points, their geometries can be simplified in otherways which can result in other lines and polygons. Simplifyingcontiguous polygons to maintain topological relationships is notnecessarily straightforward.
The point I am trying to make with the above is that there aremultiple different geometries, not a single geometry for a real worldobject that can be described/defined with RDF. Some of the moregeneralised forms of the spatial geometries can be calculated andstored as metadata in fixed number of field type tablerepresentations. Often so called bounding boxes and bounding circlesare use, as are line lengths, perimeters, surface areas, volumes,average distances, and ratios of these geometrical attributes. Basedon the geometrical attributes, further attributes can be derived forother attributes (e.g. density).
Consider something complex, like a city. This has multiple geometricalrepresentations.
Two more things:
Geohashes (http://en.wikipedia.org/wiki/Geohash) which interleavecoordinates represented by positions on axes using some predeterminedaxis order and prescription are useful in the context of linking data- as they are string representations, that the more truncated theyare, provide less precision for the location of a point, but theystart with the same string sequence.
The other key dimension to think about in geographical relations istime. How time relates to all this is important, but this email isalready long, so all I will sate is that a city now could be verydifferent to a city some years ago (in terms of spatialdimension/geometry), yet in some ways they are the same place. Thereare ways to derive (very) complex geometries of ephemeral events, youcould consider one, like the Olympic games.
HTH and sorry for the long post.

Hello Andy,

Thank you for the long post and for sharing your thoughts.

Yes, I agree that any real world object can have many differentgeometries, depending on coordinate reference system, level of detail,time, method of measurement and whatnot. But I don't think that is aproblem. Linked Data is very capable of sharing different perspectivesof a single real world phenomenon, and also of annotating thosedifferent perspectives to help with correctly interpreting them.

The problem that I see is how to handle those cases where geometryliterals become unwieldy. The GeoSparql specification that you mentionprovides a way of writing a geometry as a literal in RDF. There may beseveral approaches as to how to serialize a geometry, but ending up withseries of coordinates is inescapable. And I am worried about the impactof these series of coordinates becoming very long. That is why I also dolike the idea of providing some extra data to enable a client todistinguish between large and small geometries. The small ones could bedownloaded and processed right away, but the bigger ones might need someextra care.

Thinking about this, I wonder if the idea of a general compressionfunction for literals has ever been considered for SPARQL. That wouldenable a query like

SELECT ?name, ?population, COMPRESS(?geometry) FROM<http://example.org/cities>

Such a function could be used only for those literals whose size exceedsa certain threshold. And it would be applicable to all kinds of bigliterals.

About Geohash: Yes, it is a kind of compression for geometry, but as faras I can tell it only applies to single points.


Regards,
Frans

Andy
http://www.geog.leeds.ac.uk/people/a.turner/

*From:*Frans Knibbe | Geodan [mailto:[email protected]]
*Sent:* 10 September 2013 11:11
*To:* Leigh Dodds
*Cc:* public-lod community
*Subject:* Re: Minimizing data volume

On 9-9-2013 16:48, Leigh Dodds wrote:

    Hi,
    Before using compression you might also make a decision about whether

    you need to represent all of this information as RDF in the first

    place.
    For example, rather than include the large geometries as literals, why

    not store them as separate documents and let clients fetch the

    geometries when needed, rather than as part of a SPARQL query?
    Geometries can be served using standard HTTP compression techniques

    and will benefit from caching.
    You can provide summary statistics (including size of the document,

    and properties of the described area, e.g. centroids) in the RDF to

    help address a few common requirements, allowing clients to only fetch

    the geometries they need, as they need them.
    This can greatly reduce the volume of data you have to store and

    provides clients with more flexibility.
    Cheers,
    L.
Yes, that is something to consider. Thanks for broadening my mind! Ithink such an approach may be suited for certain kinds of high volumedata, like images or video. But I do have some doubts about itseffectiveness for geographical data:
1) In geographical data sets geometries typically have differentsizes. Some may be very big, others may be reasonably small. So whereto draw the limit?
2) When using SPARQL and RDF it is already possible to provide summarystatistics and leave it to the client to fetch the geometries ifneeded. However, it is not standard practice to provide summaries likecentroid, bounding box or coordinate count for each geometry, butperhaps it should be.
3) On the surface, this approach seems to add complexity to dataretrieval, for both clients and servers. Instead of one way ofpublishing and getting data, there will be two ways.
4) Having to fetch geometries one at a time, instead of processingthem all from one data set, could complicate matters and alsointroduce some loss of performance. I can imagine this method workingwell for things like images, videos or files, because they aretypically used one at a time. But in many cases geometries should beavailable all at once, to draw on a map for instance.
5) I think most geometries are stored as attribute data in relationaldatabases. Preprocessing them to make them available as separate filescan be done offline. But in other cases the geometries are transient,they could be generated by a function in a query. The method shouldwork with performance gains in those cases too.
Regards,
Frans
    On Mon, Sep 9, 2013 at 10:47 AM, Frans Knibbe | Geodan

    <[email protected]>  <mailto:[email protected]>  wrote:

        Hello,
        In my line of work (geographical information) I often deal with high 
volume

        data. The high volume is caused by single facts having a big size. A 
single

        2D or 3D geometry is often encoded as a single text string and can 
consist

        of thousands of numbers (coordinates). It is easy to see that this can 
cause

        performance issues with transferring and processing data. So I wonder 
about

        the state of the art in minimizing data volume in Linked Data. I know 
that

        careful publication of data will help a bit: multiple levels of detail 
could

        be published, coordinates could use significant digits (they almost 
never

        do), but it seems to me that some kind of compression is needed too. Is

        there something like a common approach to data compression at the 
moment?

        Something that is understood by both publishers and consumers of data?
        Regards,

        Frans
--
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)

T +31 (0)20 - 5711 347
E [email protected] <mailto:[email protected]>
www.geodan.nl <http://www.geodan.nl> | disclaimer<http://www.geodan.nl/disclaimer>
--------------------------------------



--
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)

T +31 (0)20 - 5711 347
E [email protected]

www.geodan.nl <http://www.geodan.nl> | disclaimer<http://www.geodan.nl/disclaimer>

--------------------------------------

Re: Minimizing data volume

Reply via email to