On 10-9-2013 13:33, Andy Turner wrote:
Hi,
At least these two OGC standards might be worth having a look at in
this context:
http://www.opengeospatial.org/standards/geosparql
http://www.opengeospatial.org/standards/tjs
The latter is a Georeferenced Table Joining Service Implementation
Standard. In the development of this a lot of thought went in to
different kinds of linking of geographical data. Sorry, but I know
very little about the GeoSPARQL standard.
The notion of keeping geometry data separate and providing metadata
about geometries in standard forms is useful. For vector data, the
number of points in the geometry is one of the key attributes an
application might consider before pulling that geometry. (Also the
size of its representation in bytes -- both compressed and
uncompressed is useful info too -- thanks Leigh.)
So, for vector data, the attributes for individual vectors (almost
like features) can be kept separate from the spatial geometries, and
some linkage code can be used to join the data together. Yes, there
are advantages in terms of storage organisation for keeping attributes
and geometries separate, but for many applications some attributes of
the geometries are also wanted, this geometrical metadata is important
to think about. Computationally some of it can be hard to calculate,
so once calculated it is perhaps worth storing in optional metadata.
Individual points with a single attribute, where the point is defined
with respect to axes in some geographical coordinate and projection
system are simple geo-vectors. Lines built from multiple such points
(and equations) are more detailed/complex, yet these can have simply
attributed generalised point representations (the location of a
smallest circle/sphere encompassing all the points in the line --
perhaps with a measure of the radius of this). There are similar
things for regional polygons in two and three dimensions.
With lines and points, their geometries can be simplified in other
ways which can result in other lines and polygons. Simplifying
contiguous polygons to maintain topological relationships is not
necessarily straightforward.
The point I am trying to make with the above is that there are
multiple different geometries, not a single geometry for a real world
object that can be described/defined with RDF. Some of the more
generalised forms of the spatial geometries can be calculated and
stored as metadata in fixed number of field type table
representations. Often so called bounding boxes and bounding circles
are use, as are line lengths, perimeters, surface areas, volumes,
average distances, and ratios of these geometrical attributes. Based
on the geometrical attributes, further attributes can be derived for
other attributes (e.g. density).
Consider something complex, like a city. This has multiple geometrical
representations.
Two more things:
Geohashes (http://en.wikipedia.org/wiki/Geohash) which interleave
coordinates represented by positions on axes using some predetermined
axis order and prescription are useful in the context of linking data
- as they are string representations, that the more truncated they
are, provide less precision for the location of a point, but they
start with the same string sequence.
The other key dimension to think about in geographical relations is
time. How time relates to all this is important, but this email is
already long, so all I will sate is that a city now could be very
different to a city some years ago (in terms of spatial
dimension/geometry), yet in some ways they are the same place. There
are ways to derive (very) complex geometries of ephemeral events, you
could consider one, like the Olympic games.
HTH and sorry for the long post.
Hello Andy,
Thank you for the long post and for sharing your thoughts.
Yes, I agree that any real world object can have many different
geometries, depending on coordinate reference system, level of detail,
time, method of measurement and whatnot. But I don't think that is a
problem. Linked Data is very capable of sharing different perspectives
of a single real world phenomenon, and also of annotating those
different perspectives to help with correctly interpreting them.
The problem that I see is how to handle those cases where geometry
literals become unwieldy. The GeoSparql specification that you mention
provides a way of writing a geometry as a literal in RDF. There may be
several approaches as to how to serialize a geometry, but ending up with
series of coordinates is inescapable. And I am worried about the impact
of these series of coordinates becoming very long. That is why I also do
like the idea of providing some extra data to enable a client to
distinguish between large and small geometries. The small ones could be
downloaded and processed right away, but the bigger ones might need some
extra care.
Thinking about this, I wonder if the idea of a general compression
function for literals has ever been considered for SPARQL. That would
enable a query like
SELECT ?name, ?population, COMPRESS(?geometry) FROM
<http://example.org/cities>
Such a function could be used only for those literals whose size exceeds
a certain threshold. And it would be applicable to all kinds of big
literals.
About Geohash: Yes, it is a kind of compression for geometry, but as far
as I can tell it only applies to single points.
Regards,
Frans
Andy
http://www.geog.leeds.ac.uk/people/a.turner/
*From:*Frans Knibbe | Geodan [mailto:[email protected]]
*Sent:* 10 September 2013 11:11
*To:* Leigh Dodds
*Cc:* public-lod community
*Subject:* Re: Minimizing data volume
On 9-9-2013 16:48, Leigh Dodds wrote:
Hi,
Before using compression you might also make a decision about whether
you need to represent all of this information as RDF in the first
place.
For example, rather than include the large geometries as literals, why
not store them as separate documents and let clients fetch the
geometries when needed, rather than as part of a SPARQL query?
Geometries can be served using standard HTTP compression techniques
and will benefit from caching.
You can provide summary statistics (including size of the document,
and properties of the described area, e.g. centroids) in the RDF to
help address a few common requirements, allowing clients to only fetch
the geometries they need, as they need them.
This can greatly reduce the volume of data you have to store and
provides clients with more flexibility.
Cheers,
L.
Yes, that is something to consider. Thanks for broadening my mind! I
think such an approach may be suited for certain kinds of high volume
data, like images or video. But I do have some doubts about its
effectiveness for geographical data:
1) In geographical data sets geometries typically have different
sizes. Some may be very big, others may be reasonably small. So where
to draw the limit?
2) When using SPARQL and RDF it is already possible to provide summary
statistics and leave it to the client to fetch the geometries if
needed. However, it is not standard practice to provide summaries like
centroid, bounding box or coordinate count for each geometry, but
perhaps it should be.
3) On the surface, this approach seems to add complexity to data
retrieval, for both clients and servers. Instead of one way of
publishing and getting data, there will be two ways.
4) Having to fetch geometries one at a time, instead of processing
them all from one data set, could complicate matters and also
introduce some loss of performance. I can imagine this method working
well for things like images, videos or files, because they are
typically used one at a time. But in many cases geometries should be
available all at once, to draw on a map for instance.
5) I think most geometries are stored as attribute data in relational
databases. Preprocessing them to make them available as separate files
can be done offline. But in other cases the geometries are transient,
they could be generated by a function in a query. The method should
work with performance gains in those cases too.
Regards,
Frans
On Mon, Sep 9, 2013 at 10:47 AM, Frans Knibbe | Geodan
<[email protected]> <mailto:[email protected]> wrote:
Hello,
In my line of work (geographical information) I often deal with high
volume
data. The high volume is caused by single facts having a big size. A
single
2D or 3D geometry is often encoded as a single text string and can
consist
of thousands of numbers (coordinates). It is easy to see that this can
cause
performance issues with transferring and processing data. So I wonder
about
the state of the art in minimizing data volume in Linked Data. I know
that
careful publication of data will help a bit: multiple levels of detail
could
be published, coordinates could use significant digits (they almost
never
do), but it seems to me that some kind of compression is needed too. Is
there something like a common approach to data compression at the
moment?
Something that is understood by both publishers and consumers of data?
Regards,
Frans
--
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)
T +31 (0)20 - 5711 347
E [email protected] <mailto:[email protected]>
www.geodan.nl <http://www.geodan.nl> | disclaimer
<http://www.geodan.nl/disclaimer>
--------------------------------------
--
--------------------------------------
*Geodan*
President Kennedylaan 1
1079 MB Amsterdam (NL)
T +31 (0)20 - 5711 347
E [email protected]
www.geodan.nl <http://www.geodan.nl> | disclaimer
<http://www.geodan.nl/disclaimer>
--------------------------------------