Is anyone looking for a neat project in Calcite that would have a big
impact? I'm thinking that we could add support for spatial indexes to
Calcite in such a way that downstream projects such as Phoenix and
Flink could easily benefit from it.
GIS (Geographic Information Systems, aka Spatial database) is really
useful functionality to have in your database. To find restaurants
less than 1 km from downtown San Francisco, you could run
select *
from restaurants as r
where st_distance(point(-122.4194, 37.7749), r.coordinates) <= 1;
There are mature SQL implementations of GIS in PostGIS, Oracle Spatial
and Microsoft SQL Server; and OpenGIS has standardized SQL
extensions[1].
Now, the SQL-GIS standard is rather large, and involves implementing
lots of data types and scalar functions. We could get to that
eventually. But I contend that many, many applications would be
satisfied by points and distances (like the query above) and a spatial
index to make them run quickly. And I believe that we can add spatial
index support to Calcite using a logical rewrite rule.
Rewriting spatial queries to indexes on space-filling curves is a
well-established technique [2].
Suppose that the restaurants table, above, had columns latitude and
longitude and a computed numeric column h = hilbert(latitude,
longitude). Hilbert curves are space-filling curves such that if two
points are close in space then their h values will be close. So, if
there is an index on h, we can find all restaurants close to a given
point using a range scan of the index.
So, the above query could be rewritten to something like
select *
from restaurants as r
where (r.h between 123456 and 123599
or r.h between 256789 and 259887)
and st_distance_internal(point(-122.4194, 37.7749), r.coordinates) <= 1;
The range predicates on r.h quickly eliminate 99.9% of the rows in the
database, and the call to st_distance_internal eliminates the
remaining false positives.
That rewrite can be done using a logical rewrite rule, and the
resulting query will be faster on just about any database, but
especially one with key-sorted tables (like Phoenix/HBase) or
range-partitioned tables. The database does not need to have a
dedicated "spatial index" data structure.
Julian
[1] http://www.opengeospatial.org/standards/sfs
[2] http://math.bme.hu/~gnagy/mmsz/eloadasok/BisztrayDenes2014.pdf