Hi Björn, Would it make sense to implement it not as a separate format, but as a special convention on top of any of already existing ones? What happens if you sort by Hilbert curve a shapefile or geopackage, and tweak the internals to build indexes in flattish manner?
For example, here is a branch of Postgres where GiST index is being built in flat manner. \ It does not require any change on user's side, but gets you all the benefits by plainly building 10x faster. https://github.com/glukhovn/postgres/tree/gist_btree_build On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell <[email protected]> wrote: > I've now implemented so that the traversal of the R-tree reads nodes on > demand and as suspected it gives a significant boost: > > > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 > gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done > real 0m0,378s (~ 20% improvement) > > > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 > gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; > done > real 0m0,790s (~ 30% improvement) > > I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg: > > > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 > gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; > done > real 0m0,305s > > > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1 > gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; > done > real 0m0,242s > > It still remains a TODO to further optimize I/O to read larger blocks if > possible (applies both to R-tree nodes and the feature data itself) and I > think that could prove a very significant performance boost potentially > leaving any competition in the dust. :D > > /Björn > > Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell < > [email protected]>: > >> I've now implemented spatial filtering (at GDAL fork >> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are >> promising, for two test cases about 3,5-4 times faster than shapefile with >> the same content. I've used these command lines to measure total time over >> 10 iterations for each format and dataset: >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done >> real 0m1,940s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done >> real 0m0,750s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done >> real 0m0,461s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single; >> done >> real 0m4,381s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single; >> done >> real 0m1,624s >> >> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2 >> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single; >> done >> real 0m1,113s >> >> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total >> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for >> gis_osm_buildings_a_free_1_single. >> >> And there is definitely room for optimization. The implementation is >> naive and reads the whole R-tree in memory and does not yet read >> consecutive features (common case as the R-tree is balanced) in larger I/O >> blocks which I think can be significant. If I reduce the extent to 10 56 >> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being >> able to read the index partially. >> >> /Björn >> >> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell < >> [email protected]>: >> >>> After posting about my experimental format I realized that I lack >>> numbers on the potential performance, so I tried to make some more or less >>> scientific measuring. The results was disappointing, reaching similar >>> performance as shapefile for full sequential reads and so I lost interest >>> for a while. But recently I found out that my method of measuring was >>> flawed - I measured the time to read into a new shapefile using ogr2ogr, >>> but it turns out that can be quite unfair due to the string field length >>> dynamic resize that ogr2ogr will do if the strings from input feature >>> attributes are of variable length. I've now made some new measurements >>> using ogrinfo and the undocumented flag -qq to get a full sequential read. >>> >>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to >>> single poly) and "gis_osm_roads_free_1" from Denmark at >>> http://download.geofabrik.de/europe.html as the source, converted to >>> respective format, then measured the time it takes to run ogrinfo -qq on >>> them. Results (average over three runs): >>> >>> ## gis_osm_buildings_a_free_1 (2027890 Polygons) >>> * SHP: 3800 ms >>> * GPKG: 2700 ms >>> * FlatGeobuf: 2200 ms >>> >>> ## gis_osm_roads_free_1 (812547 LineStrings) >>> * SHP: 1600 ms >>> * GPKG: 1350 ms >>> * FlatGeobuf: 800 ms >>> >>> With that my hope and interest is rekindled. I believe I can fare even >>> better against the competition with with spatially filtered searches, will >>> hopefully soon have some results on that too. >>> >>> /Björn >>> >>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell < >>> [email protected]>: >>> >>>> Hi GDAL/OGR folks, >>>> >>>> In my spare time I've been working on a vector file format called >>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning, >>>> I'm putting time into it is that I think shapefile still rules the >>>> read/query static data performance game, which is kind of sad, and probably >>>> one of the reasons it is still so widely popular. Also, the main competitor >>>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles >>>> also handles surprisingly (?) well. >>>> >>>> By using a performance focused write once binary encoding >>>> (flatbuffers), schema constraint and focusing on read/query performance by >>>> clustering on an optimal spatial index (Packed Hilbert R-Tree) I hope to be >>>> able to beat shapefile performance and at the same time be optimal for >>>> streaming/cloud access. >>>> >>>> I think I'm starting to get somewhere, more details and source is at >>>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof >>>> of concept driver implementation at >>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are >>>> already quite promising - it can do roundtrip read/write and is already >>>> quite a bit faster than shapefile. I also have implemented naive read only >>>> QGIS provider for experimental purposes. >>>> >>>> Basically I'm fishing for interest in this effort, hoping that others >>>> will see potential in it even if it's "yet another format" and far from >>>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good >>>> place for a format specification and reference implementation to incubate. >>>> >>>> Some additional food for thought that I've had during the >>>> experimentation: >>>> >>>> 1. The main in memory representations of geometry/coordinates seem to >>>> be either separate arrays per dimension (GDAL (partially?) and QGIS) or a >>>> single array with dimension as stride. I've chosen the latter as of yet but >>>> I'm still a bit undecided. There is of course a substantial involved in >>>> transforming between the two representations so the situation with two >>>> competing models is a bit unfortunate. I'm also not sure about which of >>>> these models are objectively "better" than the other? >>>> >>>> 2. One could get better space efficiency with protobuf instead of >>>> flatbuffers, but it has a performance cost. I have not gone far into >>>> investigating how much though and one could even reason about supporting >>>> both these encodings in a single format specification but it's taking it a >>>> bit far I think. >>>> >>>> 3. Is there a reason to support different packing strategies for the >>>> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully >>>> simple to implement)? >>>> >>>> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and >>>> has a bad/boring abbreviation. Other candidates I've considered are >>>> OpenGeoFile, OpenVectorFile or OpenSpatialFile but I'm undecided. Any >>>> ideas? :) >>>> >>>> Regards, >>>> >>>> Björn Harrtell >>>> >>> _______________________________________________ > gdal-dev mailing list > [email protected] > https://lists.osgeo.org/mailman/listinfo/gdal-dev -- Darafei Praliaskouski Support me: http://patreon.com/komzpa
_______________________________________________ gdal-dev mailing list [email protected] https://lists.osgeo.org/mailman/listinfo/gdal-dev
