It would probably be possible to get substantial gains on existing formats that have spatial indexing by making them more optimal/balanced however I see the following major issues:
* Not sure how by convention deal with non-updatable indexes (packing/sorting on hilbert or sort-tile-recursive seem inherently static) on formats that are assumed to be mutable * GeoPackage cannot benefit from clustering on spatial index (so it will not be possible to "cloud" optimize) * Shapefile has several other drawbacks as a format perhaps making it less interesting to invest into I have wondered several times about these things in relation to PostgreSQL and PostGIS... first thought is why there is no way to mark a table static/immutable in PostgreSQL, it seems to be there would be several general optimization possibilities if it was known that a table cannot be changed, not the least that it would make it possible to construct similar packed R-trees as discussed above. Second thought is how the R-tree is constructed/implemented via gist index (I have not had the brain power/interest to dive into the source to understand it yet) - unless it can be static it should probably optimally be a R*-tree with topological split but I've got indications that it is not. /Björn Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski < [email protected]>: > Hi Björn, > > Would it make sense to implement it not as a separate format, but as a > special convention on top of any of already existing ones? > What happens if you sort by Hilbert curve a shapefile or geopackage, and > tweak the internals to build indexes in flattish manner? > > For example, here is a branch of Postgres where GiST index is being built > in flat manner. \ > It does not require any change on user's side, but gets you all the > benefits by plainly building 10x faster. > https://github.com/glukhovn/postgres/tree/gist_btree_build > > > On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <[email protected]> > wrote: > >> I've now implemented so that the traversal of the R-tree reads nodes on >> demand and as suspected it gives a significant boost: >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done >> real 0m0,378s (~ 20% improvement) >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >> done >> real 0m0,790s (~ 30% improvement) >> >> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg: >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 >> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; >> done >> real 0m0,305s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 >> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >> done >> real 0m0,242s >> >> It still remains a TODO to further optimize I/O to read larger blocks if >> possible (applies both to R-tree nodes and the feature data itself) and I >> think that could prove a very significant performance boost potentially >> leaving any competition in the dust. :D >> >> /Björn >> >> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell < >> [email protected]>: >> >>> I've now implemented spatial filtering (at GDAL fork >>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are >>> promising, for two test cases about 3,5-4 times faster than shapefile with >>> the same content. I've used these command lines to measure total time over >>> 10 iterations for each format and dataset: >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done >>> real 0m1,940s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done >>> real 0m0,750s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done >>> real 0m0,461s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single; >>> done >>> real 0m4,381s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; >>> done >>> real 0m1,624s >>> >>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >>> done >>> real 0m1,113s >>> >>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total >>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for >>> gis_osm_buildings_a_free_1_single. >>> >>> And there is definitely room for optimization. The implementation is >>> naive and reads the whole R-tree in memory and does not yet read >>> consecutive features (common case as the R-tree is balanced) in larger I/O >>> blocks which I think can be significant. If I reduce the extent to 10 56 >>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being >>> able to read the index partially. >>> >>> /Björn >>> >>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell < >>> [email protected]>: >>> >>>> After posting about my experimental format I realized that I lack >>>> numbers on the potential performance, so I tried to make some more or less >>>> scientific measuring. The results was disappointing, reaching similar >>>> performance as shapefile for full sequential reads and so I lost interest >>>> for a while. But recently I found out that my method of measuring was >>>> flawed - I measured the time to read into a new shapefile using ogr2ogr, >>>> but it turns out that can be quite unfair due to the string field length >>>> dynamic resize that ogr2ogr will do if the strings from input feature >>>> attributes are of variable length. I've now made some new measurements >>>> using ogrinfo and the undocumented flag -qq to get a full sequential read. >>>> >>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to >>>> single poly) and "gis_osm_roads_free_1" from Denmark at >>>> http://download.geofabrik.de/europe.html as the source, converted to >>>> respective format, then measured the time it takes to run ogrinfo -qq on >>>> them. Results (average over three runs): >>>> >>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons) >>>> * SHP: 3800 ms >>>> * GPKG: 2700 ms >>>> * FlatGeobuf: 2200 ms >>>> >>>> ## gis_osm_roads_free_1 (812547 LineStrings) >>>> * SHP: 1600 ms >>>> * GPKG: 1350 ms >>>> * FlatGeobuf: 800 ms >>>> >>>> With that my hope and interest is rekindled. I believe I can fare even >>>> better against the competition with with spatially filtered searches, will >>>> hopefully soon have some results on that too. >>>> >>>> /Björn >>>> >>>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell < >>>> [email protected]>: >>>> >>>>> Hi GDAL/OGR folks, >>>>> >>>>> In my spare time I've been working on a vector file format called >>>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning, >>>>> I'm putting time into it is that I think shapefile still rules the >>>>> read/query static data performance game, which is kind of sad, and >>>>> probably >>>>> one of the reasons it is still so widely popular. Also, the main >>>>> competitor >>>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles >>>>> also handles surprisingly (?) well. >>>>> >>>>> By using a performance focused write once binary encoding >>>>> (flatbuffers), schema constraint and focusing on read/query performance by >>>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to >>>>> be >>>>> able to beat shapefile performance and at the same time be optimal for >>>>> streaming/cloud access. >>>>> >>>>> I think I'm starting to get somewhere, more details and source is at >>>>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof >>>>> of concept driver implementation at >>>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are >>>>> already quite promising - it can do roundtrip read/write and is already >>>>> quite a bit faster than shapefile. I also have implemented naive read only >>>>> QGIS provider for experimental purposes. >>>>> >>>>> Basically I'm fishing for interest in this effort, hoping that others >>>>> will see potential in it even if it's "yet another format" and far from >>>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good >>>>> place for a format specification and reference implementation to incubate. >>>>> >>>>> Some additional food for thought that I've had during the >>>>> experimentation: >>>>> >>>>> 1. The main in memory representations of geometry/coordinates seem to >>>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a >>>>> single array with dimension as stride. I've chosen the latter as of yet >>>>> but >>>>> I'm still a bit undecided. There is of course a substantial involved in >>>>> transforming between the two representations so the situation with two >>>>> competing models is a bit unfortunate. I'm also not sure about which of >>>>> these models are objectively "better" than the other? >>>>> >>>>> 2. One could get better space efficiency with protobuf instead of >>>>> flatbuffers, but it has a performance cost. I have not gone far into >>>>> investigating how much though and one could even reason about supporting >>>>> both these encodings in a single format specification but it's taking it a >>>>> bit far I think. >>>>> >>>>> 3. Is there a reason to support different packing strategies for the >>>>> R-Tree or is Packed Hilbert a good compromise (besides it being >>>>> beautifully >>>>> simple to implement)? >>>>> >>>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and >>>>> has a bad/boring abbreviation. Other candidates I've considered are >>>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any >>>>> ideas? :) >>>>> >>>>> Regards, >>>>> >>>>> Björn Harrtell >>>>> >>>> _______________________________________________ >> gdal-dev mailing list >> [email protected] >> https://lists.osgeo.org/mailman/listinfo/gdal-dev > > > > -- > Darafei Praliaskouski > Support me: http://patreon.com/komzpa >
_______________________________________________ gdal-dev mailing list [email protected] https://lists.osgeo.org/mailman/listinfo/gdal-dev
