Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-03-11 Thread Björn Harrtell
Den mån 11 mars 2019 kl 15:07 skrev Darafei "Komяpa" Praliaskouski <
m...@komzpa.net>:

>
>
> On Mon, Mar 11, 2019 at 1:08 AM Björn Harrtell 
> wrote:
>
>> It would probably be possible to get substantial gains on existing
>> formats that have spatial indexing by making them more optimal/balanced
>> however I see the following major issues:
>>
>> * Not sure how by convention deal with non-updatable indexes
>> (packing/sorting on hilbert or sort-tile-recursive seem inherently static)
>> on formats that are assumed to be mutable
>>
>
> Why does Hilbert indexing sets a requirement for it to be non-updatable?
>

> If it's packed into same tree format, it's going to be updatable just the
> same as the previous version. Yeah, it's going to degrade a bit on changes,
> but if the tool to put sorting back in place is available then it's not a
> big issue?
>

I'm not entirely sure but it seems to me that R-tree algorithms are divided
into static or dynamic ones for good reasons. The static ones have a
predictable structure with near 100% space utilization and I believe
assumptions can be used for performance in the implementation as long as
the structure is kept intact. Dynamic ones are typically more complex
because they need a strategy for tree insertion/balancing.

See https://github.com/mourner/rbush and https://github.com/mourner/flatbush
for contemporary and minimal implementations of the dynamic / static cases
also with reference to the algorithms used.

>
>
>> * GeoPackage cannot benefit from clustering on spatial index (so it will
>> not be possible to "cloud" optimize)
>>
>
> Why? What should it get for it to be possible?
>

I'm taking Evens word for it (see earlier in this thread) but I shouldn't
have written cannot, he wrote it's likely possible but as I interpreted it
quite far from trivial. When Even writes something is not trivial, I
instinctively think it's near impossible. :)


> * Shapefile has several other drawbacks as a format perhaps making it less
>> interesting to invest into
>>
>
> Its killer feature is universal support.
>

Agreed, but I'm saying I don't want to see that being the case forever? :)


> I have wondered several times about these things in relation to PostgreSQL
>> and PostGIS... first thought is why there is no way to mark a table
>> static/immutable in PostgreSQL, it seems to be there would be several
>> general optimization possibilities if it was known that a table cannot be
>> changed, not the least that it would make it possible to construct similar
>> packed R-trees as discussed above.
>>
>
> You probably missed a link I posted,
> https://github.com/glukhovn/postgres/tree/gist_btree_buil
> d - here a
> GiST is built in similar way, without requiring read onliness of the table.
>
>
>
>> Second thought is how the R-tree is constructed/implemented via gist
>> index (I have not had the brain power/interest to dive into the source to
>> understand it yet) - unless it can be static it should probably optimally
>> be a R*-tree with topological split but I've got indications that it is not.
>>
>
> GiST is just *a* tree on top of support functions. More sophisticated
> versions of tree are limited by what callbacks Postgres provides.
>
>
> /Björn
>
> Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski <
> m...@komzpa.net>:
>
>> Hi Björn,
>>
>> Would it make sense to implement it not as a separate format, but as a
>> special convention on top of any of already existing ones?
>> What happens if you sort by Hilbert curve a shapefile or geopackage, and
>> tweak the internals to build indexes in flattish manner?
>>
>> For example, here is a branch of Postgres where GiST index is being built
>> in flat manner. \
>> It does not require any change on user's side, but gets you all the
>> benefits by plainly building 10x faster.
>> https://github.com/glukhovn/postgres/tree/gist_btree_build
>>
>>
>> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell 
>> wrote:
>>
>>> I've now implemented so that the traversal of the R-tree reads nodes on
>>> demand and as suspected it gives a significant boost:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>> real 0m0,378s (~ 20% improvement)
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,790s (~ 30% improvement)
>>>
>>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,305s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m0,242s
>>>
>>> It still remains a TODO to further optimize I/O 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-03-11 Thread Komяpa
On Mon, Mar 11, 2019 at 1:08 AM Björn Harrtell 
wrote:

> It would probably be possible to get substantial gains on existing formats
> that have spatial indexing by making them more optimal/balanced however I
> see the following major issues:
>
> * Not sure how by convention deal with non-updatable indexes
> (packing/sorting on hilbert or sort-tile-recursive seem inherently static)
> on formats that are assumed to be mutable
>

Why does Hilbert indexing sets a requirement for it to be non-updatable?

If it's packed into same tree format, it's going to be updatable just the
same as the previous version. Yeah, it's going to degrade a bit on changes,
but if the tool to put sorting back in place is available then it's not a
big issue?



> * GeoPackage cannot benefit from clustering on spatial index (so it will
> not be possible to "cloud" optimize)
>

Why? What should it get for it to be possible?


> * Shapefile has several other drawbacks as a format perhaps making it less
> interesting to invest into
>

Its killer feature is universal support.


> I have wondered several times about these things in relation to PostgreSQL
> and PostGIS... first thought is why there is no way to mark a table
> static/immutable in PostgreSQL, it seems to be there would be several
> general optimization possibilities if it was known that a table cannot be
> changed, not the least that it would make it possible to construct similar
> packed R-trees as discussed above.
>

You probably missed a link I posted,
https://github.com/glukhovn/postgres/tree/gist_btree_buil
d - here a GiST
is built in similar way, without requiring read onliness of the table.



> Second thought is how the R-tree is constructed/implemented via gist index
> (I have not had the brain power/interest to dive into the source to
> understand it yet) - unless it can be static it should probably optimally
> be a R*-tree with topological split but I've got indications that it is not.
>

GiST is just *a* tree on top of support functions. More sophisticated
versions of tree are limited by what callbacks Postgres provides.


/Björn

Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski <
m...@komzpa.net>:

> Hi Björn,
>
> Would it make sense to implement it not as a separate format, but as a
> special convention on top of any of already existing ones?
> What happens if you sort by Hilbert curve a shapefile or geopackage, and
> tweak the internals to build indexes in flattish manner?
>
> For example, here is a branch of Postgres where GiST index is being built
> in flat manner. \
> It does not require any change on user's side, but gets you all the
> benefits by plainly building 10x faster.
> https://github.com/glukhovn/postgres/tree/gist_btree_build
>
>
> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell 
> wrote:
>
>> I've now implemented so that the traversal of the R-tree reads nodes on
>> demand and as suspected it gives a significant boost:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>> real 0m0,378s (~ 20% improvement)
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,790s (~ 30% improvement)
>>
>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,305s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,242s
>>
>> It still remains a TODO to further optimize I/O to read larger blocks if
>> possible (applies both to R-tree nodes and the feature data itself) and I
>> think that could prove a very significant performance boost potentially
>> leaving any competition in the dust. :D
>>
>> /Björn
>>
>> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
>> bjorn.harrt...@gmail.com>:
>>
>>> I've now implemented spatial filtering (at GDAL fork
>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>>> promising, for two test cases about 3,5-4 times faster than shapefile with
>>> the same content. I've used these command lines to measure total time over
>>> 10 iterations for each format and dataset:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>>> real 0m1,940s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>>> real 0m0,750s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>> real 0m0,461s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-03-11 Thread Andreas Oxenstierna
We see significant performance boost when clustering a large PostGIS 
table on the spatial index. This will rewrite the table in spatial index 
order.

CLUSTER  USING 
My understanding is that you need to periodically recluster the table 
when updated, which is an atomic exclusive rewrite of the complete table 
i.e. will take time.

You may ask the Postgre team about this, e.g. Magnus Hagander.
You may ask Sandro Santilli (PostGIS topology) about topological aspects.


It would probably be possible to get substantial gains on existing 
formats that have spatial indexing by making them more 
optimal/balanced however I see the following major issues:


* Not sure how by convention deal with non-updatable indexes 
(packing/sorting on hilbert or sort-tile-recursive seem inherently 
static) on formats that are assumed to be mutable
* GeoPackage cannot benefit from clustering on spatial index (so it 
will not be possible to "cloud" optimize)
* Shapefile has several other drawbacks as a format perhaps making it 
less interesting to invest into


I have wondered several times about these things in relation to 
PostgreSQL and PostGIS... first thought is why there is no way to mark 
a table static/immutable in PostgreSQL, it seems to be there would be 
several general optimization possibilities if it was known that a 
table cannot be changed, not the least that it would make it possible 
to construct similar packed R-trees as discussed above. Second thought 
is how the R-tree is constructed/implemented via gist index (I have 
not had the brain power/interest to dive into the source to understand 
it yet) - unless it can be static it should probably optimally be 
a R*-tree with topological split but I've got indications that it is not.


/Björn

Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski 
mailto:m...@komzpa.net>>:


Hi Björn,

Would it make sense to implement it not as a separate format, but
as a special convention on top of any of already existing ones?
What happens if you sort by Hilbert curve a shapefile or
geopackage, and tweak the internals to build indexes in flattish
manner?

For example, here is a branch of Postgres where GiST index is
being built in flat manner. \
It does not require any change on user's side, but gets you all
the benefits by plainly building 10x faster.
https://github.com/glukhovn/postgres/tree/gist_btree_build


On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell
mailto:bjorn.harrt...@gmail.com>> wrote:

I've now implemented so that the traversal of the R-tree reads
nodes on demand and as suspected it gives a significant boost:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real0m0,378s (~ 20% improvement)

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb
gis_osm_buildings_a_free_1_single; done
real0m0,790s (~ 30% improvement)

I'm now also faster on the reduced window (56 10.1 56.1)
beating gpkg:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.gpkg
gis_osm_buildings_a_free_1_single; done
real0m0,305s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.fgb
gis_osm_buildings_a_free_1_single; done
real0m0,242s

It still remains a TODO to further optimize I/O to read larger
blocks if possible (applies both to R-tree nodes and the
feature data itself) and I think that could prove a very
significant performance boost potentially leaving any
competition in the dust. :D

/Björn

Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell
mailto:bjorn.harrt...@gmail.com>>:

I've now implemented spatial filtering (at GDAL fork
https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and
results are promising, for two test cases about 3,5-4
times faster than shapefile with the same content. I've
used these command lines to measure total time over 10
iterations for each format and dataset:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
real0m1,940s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
real0m0,750s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real0m0,461s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2
56.2 gis_osm_buildings_a_free_1_single.shp
gis_osm_buildings_a_free_1_single; done

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-03-10 Thread Björn Harrtell
It would probably be possible to get substantial gains on existing formats
that have spatial indexing by making them more optimal/balanced however I
see the following major issues:

* Not sure how by convention deal with non-updatable indexes
(packing/sorting on hilbert or sort-tile-recursive seem inherently static)
on formats that are assumed to be mutable
* GeoPackage cannot benefit from clustering on spatial index (so it will
not be possible to "cloud" optimize)
* Shapefile has several other drawbacks as a format perhaps making it less
interesting to invest into

I have wondered several times about these things in relation to PostgreSQL
and PostGIS... first thought is why there is no way to mark a table
static/immutable in PostgreSQL, it seems to be there would be several
general optimization possibilities if it was known that a table cannot be
changed, not the least that it would make it possible to construct similar
packed R-trees as discussed above. Second thought is how the R-tree is
constructed/implemented via gist index (I have not had the brain
power/interest to dive into the source to understand it yet) - unless it
can be static it should probably optimally be a R*-tree with topological
split but I've got indications that it is not.

/Björn

Den mån 25 feb. 2019 kl 11:06 skrev Darafei "Komяpa" Praliaskouski <
m...@komzpa.net>:

> Hi Björn,
>
> Would it make sense to implement it not as a separate format, but as a
> special convention on top of any of already existing ones?
> What happens if you sort by Hilbert curve a shapefile or geopackage, and
> tweak the internals to build indexes in flattish manner?
>
> For example, here is a branch of Postgres where GiST index is being built
> in flat manner. \
> It does not require any change on user's side, but gets you all the
> benefits by plainly building 10x faster.
> https://github.com/glukhovn/postgres/tree/gist_btree_build
>
>
> On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell 
> wrote:
>
>> I've now implemented so that the traversal of the R-tree reads nodes on
>> demand and as suspected it gives a significant boost:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>> real 0m0,378s (~ 20% improvement)
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,790s (~ 30% improvement)
>>
>> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,305s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m0,242s
>>
>> It still remains a TODO to further optimize I/O to read larger blocks if
>> possible (applies both to R-tree nodes and the feature data itself) and I
>> think that could prove a very significant performance boost potentially
>> leaving any competition in the dust. :D
>>
>> /Björn
>>
>> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
>> bjorn.harrt...@gmail.com>:
>>
>>> I've now implemented spatial filtering (at GDAL fork
>>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>>> promising, for two test cases about 3,5-4 times faster than shapefile with
>>> the same content. I've used these command lines to measure total time over
>>> 10 iterations for each format and dataset:
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>>> real 0m1,940s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>>> real 0m0,750s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>>> real 0m0,461s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m4,381s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m1,624s
>>>
>>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>>> done
>>> real 0m1,113s
>>>
>>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total
>>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for
>>> gis_osm_buildings_a_free_1_single.
>>>
>>> And there is definitely room for optimization. The implementation is
>>> naive and reads the whole R-tree in memory and does not yet read
>>> consecutive features (common case as the R-tree is balanced) in larger I/O
>>> blocks which I think can be 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-02-25 Thread Komяpa
Hi Björn,

Would it make sense to implement it not as a separate format, but as a
special convention on top of any of already existing ones?
What happens if you sort by Hilbert curve a shapefile or geopackage, and
tweak the internals to build indexes in flattish manner?

For example, here is a branch of Postgres where GiST index is being built
in flat manner. \
It does not require any change on user's side, but gets you all the
benefits by plainly building 10x faster.
https://github.com/glukhovn/postgres/tree/gist_btree_build


On Sun, Feb 24, 2019 at 4:59 PM Björn Harrtell 
wrote:

> I've now implemented so that the traversal of the R-tree reads nodes on
> demand and as suspected it gives a significant boost:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
> real 0m0,378s (~ 20% improvement)
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m0,790s (~ 30% improvement)
>
> I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
> done
> real 0m0,305s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m0,242s
>
> It still remains a TODO to further optimize I/O to read larger blocks if
> possible (applies both to R-tree nodes and the feature data itself) and I
> think that could prove a very significant performance boost potentially
> leaving any competition in the dust. :D
>
> /Björn
>
> Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell <
> bjorn.harrt...@gmail.com>:
>
>> I've now implemented spatial filtering (at GDAL fork
>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
>> promising, for two test cases about 3,5-4 times faster than shapefile with
>> the same content. I've used these command lines to measure total time over
>> 10 iterations for each format and dataset:
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
>> real 0m1,940s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
>> real 0m0,750s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
>> real 0m0,461s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
>> done
>> real 0m4,381s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
>> done
>> real 0m1,624s
>>
>> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
>> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
>> done
>> real 0m1,113s
>>
>> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total
>> features for gis_osm_roads_free_1 and 57699 out of 2027890 features for
>> gis_osm_buildings_a_free_1_single.
>>
>> And there is definitely room for optimization. The implementation is
>> naive and reads the whole R-tree in memory and does not yet read
>> consecutive features (common case as the R-tree is balanced) in larger I/O
>> blocks which I think can be significant. If I reduce the extent to 10 56
>> 10.1 56.1 GPKG is actually fair bit faster which I think is due to it being
>> able to read the index partially.
>>
>> /Björn
>>
>> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
>> bjorn.harrt...@gmail.com>:
>>
>>> After posting about my experimental format I realized that I lack
>>> numbers on the potential performance, so I tried to make some more or less
>>> scientific measuring. The results was disappointing, reaching similar
>>> performance as shapefile for full sequential reads and so I lost interest
>>> for a while. But recently I found out that my method of measuring was
>>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>>> but it turns out that can be quite unfair due to the string field length
>>> dynamic resize that ogr2ogr will do if the strings from input feature
>>> attributes are of variable length. I've now made some new measurements
>>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>>
>>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to
>>> single poly) and "gis_osm_roads_free_1" from Denmark at
>>> http://download.geofabrik.de/europe.html as the source, converted to
>>> respective format, then measured the time it takes to run ogrinfo -qq on
>>> them. Results (average over three runs):
>>>
>>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>>> * SHP: 3800 ms
>>> * GPKG: 2700 ms
>>> * FlatGeobuf: 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-02-24 Thread Björn Harrtell
I've now implemented so that the traversal of the R-tree reads nodes on
demand and as suspected it gives a significant boost:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real 0m0,378s (~ 20% improvement)

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m0,790s (~ 30% improvement)

I'm now also faster on the reduced window (56 10.1 56.1) beating gpkg:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
done
real 0m0,305s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.1 56.1
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m0,242s

It still remains a TODO to further optimize I/O to read larger blocks if
possible (applies both to R-tree nodes and the feature data itself) and I
think that could prove a very significant performance boost potentially
leaving any competition in the dust. :D

/Björn

Den lör 9 feb. 2019 kl 00:08 skrev Björn Harrtell :

> I've now implemented spatial filtering (at GDAL fork
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
> promising, for two test cases about 3,5-4 times faster than shapefile with
> the same content. I've used these command lines to measure total time over
> 10 iterations for each format and dataset:
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
> real 0m1,940s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
> real 0m0,750s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
> real 0m0,461s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
> done
> real 0m4,381s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
> done
> real 0m1,624s
>
> > time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
> gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
> done
> real 0m1,113s
>
> The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total features
> for gis_osm_roads_free_1 and 57699 out of 2027890 features for
> gis_osm_buildings_a_free_1_single.
>
> And there is definitely room for optimization. The implementation is naive
> and reads the whole R-tree in memory and does not yet read consecutive
> features (common case as the R-tree is balanced) in larger I/O blocks which
> I think can be significant. If I reduce the extent to 10 56 10.1 56.1 GPKG
> is actually fair bit faster which I think is due to it being able to read
> the index partially.
>
> /Björn
>
> Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell <
> bjorn.harrt...@gmail.com>:
>
>> After posting about my experimental format I realized that I lack numbers
>> on the potential performance, so I tried to make some more or less
>> scientific measuring. The results was disappointing, reaching similar
>> performance as shapefile for full sequential reads and so I lost interest
>> for a while. But recently I found out that my method of measuring was
>> flawed - I measured the time to read into a new shapefile using ogr2ogr,
>> but it turns out that can be quite unfair due to the string field length
>> dynamic resize that ogr2ogr will do if the strings from input feature
>> attributes are of variable length. I've now made some new measurements
>> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>>
>> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
>> poly) and "gis_osm_roads_free_1" from Denmark at
>> http://download.geofabrik.de/europe.html as the source, converted to
>> respective format, then measured the time it takes to run ogrinfo -qq on
>> them. Results (average over three runs):
>>
>> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
>> * SHP: 3800 ms
>> * GPKG: 2700 ms
>> * FlatGeobuf: 2200 ms
>>
>> ## gis_osm_roads_free_1 (812547 LineStrings)
>> * SHP: 1600 ms
>> * GPKG: 1350 ms
>> * FlatGeobuf: 800 ms
>>
>> With that my hope and interest is rekindled. I believe I can fare even
>> better against the competition with with spatially filtered searches, will
>> hopefully soon have some results on that too.
>>
>> /Björn
>>
>> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
>> bjorn.harrt...@gmail.com>:
>>
>>> Hi GDAL/OGR folks,
>>>
>>> In my spare time I've been working on a vector file format called
>>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>>> I'm putting time into it is that I think shapefile still rules the
>>> read/query static data performance game, which is kind of sad, and 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-02-08 Thread Björn Harrtell
I've now implemented spatial filtering (at GDAL fork
https://github.com/bjornharrtell/gdal/tree/flatgeobuf) and results are
promising, for two test cases about 3,5-4 times faster than shapefile with
the same content. I've used these command lines to measure total time over
10 iterations for each format and dataset:

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.shp gis_osm_roads_free_1; done
real 0m1,940s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.gpkg gis_osm_roads_free_1; done
real 0m0,750s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_roads_free_1.fgb gis_osm_roads_free_1; done
real 0m0,461s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.shp gis_osm_buildings_a_free_1_single;
done
real 0m4,381s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.gpkg gis_osm_buildings_a_free_1_single;
done
real 0m1,624s

> time for i in {1..10}; do ogrinfo -qq -spat 10 56 10.2 56.2
gis_osm_buildings_a_free_1_single.fgb gis_osm_buildings_a_free_1_single;
done
real 0m1,113s

The bounds 10 56 10.2 56.2 contains 21525 out of the 812547 total features
for gis_osm_roads_free_1 and 57699 out of 2027890 features for
gis_osm_buildings_a_free_1_single.

And there is definitely room for optimization. The implementation is naive
and reads the whole R-tree in memory and does not yet read consecutive
features (common case as the R-tree is balanced) in larger I/O blocks which
I think can be significant. If I reduce the extent to 10 56 10.1 56.1 GPKG
is actually fair bit faster which I think is due to it being able to read
the index partially.

/Björn

Den tis 22 jan. 2019 kl 18:43 skrev Björn Harrtell :

> After posting about my experimental format I realized that I lack numbers
> on the potential performance, so I tried to make some more or less
> scientific measuring. The results was disappointing, reaching similar
> performance as shapefile for full sequential reads and so I lost interest
> for a while. But recently I found out that my method of measuring was
> flawed - I measured the time to read into a new shapefile using ogr2ogr,
> but it turns out that can be quite unfair due to the string field length
> dynamic resize that ogr2ogr will do if the strings from input feature
> attributes are of variable length. I've now made some new measurements
> using ogrinfo and the undocumented flag -qq to get a full sequential read.
>
> I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
> poly) and "gis_osm_roads_free_1" from Denmark at
> http://download.geofabrik.de/europe.html as the source, converted to
> respective format, then measured the time it takes to run ogrinfo -qq on
> them. Results (average over three runs):
>
> ## gis_osm_buildings_a_free_1 (2027890 Polygons)
> * SHP: 3800 ms
> * GPKG: 2700 ms
> * FlatGeobuf: 2200 ms
>
> ## gis_osm_roads_free_1 (812547 LineStrings)
> * SHP: 1600 ms
> * GPKG: 1350 ms
> * FlatGeobuf: 800 ms
>
> With that my hope and interest is rekindled. I believe I can fare even
> better against the competition with with spatially filtered searches, will
> hopefully soon have some results on that too.
>
> /Björn
>
> Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell <
> bjorn.harrt...@gmail.com>:
>
>> Hi GDAL/OGR folks,
>>
>> In my spare time I've been working on a vector file format called
>> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
>> I'm putting time into it is that I think shapefile still rules the
>> read/query static data performance game, which is kind of sad, and probably
>> one of the reasons it is still so widely popular. Also, the main competitor
>> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
>> also handles surprisingly (?) well.
>>
>> By using a performance focused write once binary encoding (flatbuffers),
>> schema constraint and focusing on read/query performance by clustering on
>> an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
>> shapefile performance and at the same time be optimal for streaming/cloud
>> access.
>>
>> I think I'm starting to get somewhere, more details and source is at
>> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
>> concept driver implementation at
>> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
>> already quite promising - it can do roundtrip read/write and is already
>> quite a bit faster than shapefile. I also have implemented naive read only
>> QGIS provider for experimental purposes.
>>
>> Basically I'm fishing for interest in this effort, hoping that others
>> will see potential in it even if it's "yet another format" and far from
>> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
>> place for a format specification and reference implementation to incubate.
>>
>> Some additional food 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2019-01-22 Thread Björn Harrtell
After posting about my experimental format I realized that I lack numbers
on the potential performance, so I tried to make some more or less
scientific measuring. The results was disappointing, reaching similar
performance as shapefile for full sequential reads and so I lost interest
for a while. But recently I found out that my method of measuring was
flawed - I measured the time to read into a new shapefile using ogr2ogr,
but it turns out that can be quite unfair due to the string field length
dynamic resize that ogr2ogr will do if the strings from input feature
attributes are of variable length. I've now made some new measurements
using ogrinfo and the undocumented flag -qq to get a full sequential read.

I've used the shapefiles "gis_osm_buildings_a_free_1" (exploded to single
poly) and "gis_osm_roads_free_1" from Denmark at
http://download.geofabrik.de/europe.html as the source, converted to
respective format, then measured the time it takes to run ogrinfo -qq on
them. Results (average over three runs):

## gis_osm_buildings_a_free_1 (2027890 Polygons)
* SHP: 3800 ms
* GPKG: 2700 ms
* FlatGeobuf: 2200 ms

## gis_osm_roads_free_1 (812547 LineStrings)
* SHP: 1600 ms
* GPKG: 1350 ms
* FlatGeobuf: 800 ms

With that my hope and interest is rekindled. I believe I can fare even
better against the competition with with spatially filtered searches, will
hopefully soon have some results on that too.

/Björn

Den sön 9 dec. 2018 kl 20:36 skrev Björn Harrtell :

> Hi GDAL/OGR folks,
>
> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
> also handles surprisingly (?) well.
>
> By using a performance focused write once binary encoding (flatbuffers),
> schema constraint and focusing on read/query performance by clustering on
> an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
> shapefile performance and at the same time be optimal for streaming/cloud
> access.
>
> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.
>
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
>
> Some additional food for thought that I've had during the experimentation:
>
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?
>
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.
>
> 3. Is there a reason to support different packing strategies for the
> R-Tree or is Packed Hilbert a good compromise (besides it being beautifully
> simple to implement)?
>
> 4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
> bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
> OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)
>
> Regards,
>
> Björn Harrtell
>
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2018-12-10 Thread Björn Harrtell
Hi Benjamin,

Very interesting to read that you have experimented in similar things and
of your positive experiences with flatbuffers.

Den mån 10 dec. 2018 kl 18:18 skrev Benjamin Stadin <
benjamin.sta...@bridging-it.de>:

> Björn,
>
> Interesting to see there is some progress in this area. We've written a
> Flatbuffers based vector tile schema about 2 1/2 years ago, which is used
> in our rendering product (3D map rendering engine). I can share some
> thoughts about the general pros and cons, and what I'd do different (from
> today's perspective).
>
> Just last week I also started experimenting with using our tile format
> within QGIS, which would work similar to the MVT QGIS plugin [1] which uses
> web meractor tiling. It will parse our "HSG" flatbuffers cells (aka tiles),
> convert them on the fly into GeoJSON and pass them to QGIS.
>
> Some thoughts about your current implementation:
>
> 1) I second Even's question about the actual use-case. I think a
> traditional tile-based approach will be more efficient for bbox querying,
> and client-side visualization. I'll share my approach later below.
>

I agree and perhaps the spatial index part needs more consideration but
it's an important point that the format I'm trying to make is not something
to compete with vector tiles and render optimized representations - my aim
is for the case where you want/need to work with simple features source
data in lossless representation i.e more or less the same case as when
directly using shapefiles.


> 2) 2 1/2 years and quite some projects later, I still prefer Flatbuffers
> over protobuf for a number of reasons. Starting with direct memory access,
> easier client side parser implementation (structs instead of strange
> parallel arrays and such), performance. But it has one limitation in
> regards to memory usage and file size. Flatbuffers will require more memory
> (up to the size of the flatbuffers raw data) when you create or read a
> large flatbuffers file. This is due to the direct memory access
> capabilities. You can create a size-prefixed flatbuffers file (I've asked a
> question here [2]). But this will have negative side effects for your
> r-tree based approach. You'll need to read larger chunks (thus more
> unrelated data) to stream individual features. For streaming bbox queries a
> tile based approach is still more efficient, and simpler. Data in one area
> is packed into one memory region, and can be queried using common tile
> pyramid techniques (or variations like ours).
>

I haven't dived deep enough into this yet but my hope is that since I
constrain my format to be non-updatable I can construct and assume I have a
balanced R-tree with a known size and structure and as such I should not
need to read more of it than needed for traversal into the relevant part of
the tree and then I can use the offset index to get byte offsets to each
feature out of a tree query result. Because I also require the features to
be written in the order the R-tree has been built they will be clustered on
R-tree nodes so that most queries should result in small sets of sequential
reads to get the features.


> 3) The r-rtee implementation might be sufficient for simple use cases. But
> when later requiring to deal with 3D data and large data sets (like OSM)
> the implementation will become more complex, and would require to handle
> efficient disk based indexing instead of having to read the whole index
> into memory (I think the memory footprint for the index of the OSM data set
> will be in the region of 1GB++). In the end the advantage might not be so
> big compared to e.g. SQLite's R-Tree virtual table. It's actually a quite
> fast implementation, considering its low memory footprint (by cost of disk
> accesses).
>

Agreed that I might have underestimated the complexity of a partial
buffered tree traversal, but consider the above answer.


> 4) Seconding Even's suggestion about using WKB. Our implementation is also
> related to WKB types, but we defined the geometry types as part of the
> schema similar to yours. It works fine, but today I'd choose a byte array
> for WKB data for generic use and easier integration into existing software
> (no need to write a custom geometry parser for any language, WKB readers
> already exist which do that for you).
>

While I mosty agree I'm not entirely convinced, see answer to Even.


> 5) In case the vector format is supposed for client-side rendering, how to
> handle zoom levels properly?
>

I view this as out of scope for the format.


> From an adoptability and maintenance point of view, I'd try everything
> possible to avoid recreating too much. I think an "integrative" approach is
> a better choice. E.g. creating a Geopackage extension for the vector tiles
> format analog to WebP, using WKB as byte array, and referring to the common
> tile pyramid scheme for easier implementation and integration possibilities
> for client sider rendering.
>

While that sounds like an 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2018-12-10 Thread Björn Harrtell
Thanks Even, answers inlined.

Den mån 10 dec. 2018 kl 13:19 skrev Even Rouault :

> Björn,
>
> > In my spare time I've been working on a vector file format called
> > FlatGeobuf (tentatively). The main reason, besides curiosity and
> learning,
> > I'm putting time into it is that I think shapefile still rules the
> > read/query static data performance game, which is kind of sad, and
> probably
> > one of the reasons it is still so widely popular. Also, the main
> competitor
> > (GeoPackage) isn't suitable for streaming access (AFAIK)
>
> I suspect that you could organize the layout of a sqlite3 file to be
> streaming
> friendly, but the sqlite3 lib itself is probably not ready for that (or
> you
> have to cheat by buffering a sufficiently large number of bytes in memory
> and
> use a custom VFS to read in that buffer. at that point, implementing your
> own
> dedicated sqlite3 reader might be better. likely doable, but not trivial).
> Random access is possible (you can use /vsicurl/ etc on a geopackage), but
> it
> might involve a number of seeks and small reads in the B-Tree / R-Tree.


> That raises a major question. Which use case(s) do you want to address
> specifically. From what I've understood, for network access:
> - streaming access for progressive display as bytes come in
> - efficient bbox queries with minimized number of bytes and ranges in the
> files to read (minimizing the number of ranges of bytes is probably the
> most
> important criterion since reading 1 byte or 1000 will take the same time)
>

The use case I had in mind is not the lossy and render optimized one, for
that use case I think vector tiles are a good design. What I'm aiming for
is essentially something to replace the shapefile i.e a lossless container
of simple features with as good as possible performance for bbox query
reads or full dataset reads via network or other means of I/O without
having to deal with preprocessing into tiles, generalization etc.


> > I think I'm starting to get somewhere, more details and source is at
> > https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> > concept driver implementation at
> > https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> > already quite promising - it can do roundtrip read/write and is already
> > quite a bit faster than shapefile. I also have implemented naive read
> only
> > QGIS provider for experimental purposes.
>
> What are the I and O in the I+O related to the R-Tree index ?
>

I symbolizes the the R-Tree nodes. O is a separate array with feature
offsets (the byte offset of each feature), so that together you can quickly
get the byte ranges that needs to be fetched for a spatial query.


> Wondering if a 3D index could be an option in case you want to address the
> full 3D case at some point. But might be something for later.
>
> I'm not familiar with flatbuf, but is random access in the Feature table
> by
> feature index is possible (without a preliminary scan pass), similarly to
> a
> .shx file in a shapefile ?
>

That is the purpose of the O as explained above. Each feature is a separate
flatbuffer message and can be accessed directly.


> Just a detail: it could be nice to have some global flag in the header
> that
> would mean "index of any feature = its FID, its FID - 1, or no particular
> correlation between feature index and feature FID"
>

Not sure exactly what you mean, but I've considered having an optional FID
index to support fast random access by FID in the cases where FID is not
the same as the index of the feature and I guess what you are saying it
this should be explicit. This is not yet added to the spec.


> For completness of the attribute types, you could have date, time and
> binary.
> Can the concept of null value or empty value or both for a field value be
> encoded ?
>

Yes, perhaps it would be useful to have dedicated types for date, time and
binary. I recently added datetime.

The columns/field definition is static for the layer. Values are required
to specify a column index. Null/missing values are represented by simply
omitting values for column indexes. An empty values array for a feature =
all values are null.

The Column structure could also receive more optional info: a long
> description, maximum length for string, optional precision/scale
> formatting
> for those nostalgic of decimal formatting
>

Agreed.


> If you want full genericity to express a SRS, allowing a WKT CRS string as
> an
> alternative for authority+code.
>

Agreed, I should consider it.


> >
> > Basically I'm fishing for interest in this effort, hoping that others
> will
> > see potential in it even if it's "yet another format" and far from
> > final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> > place for a format specification and reference implementation to
> incubate.
> >
> > Some additional food for thought that I've had during the
> experimentation:
> >
> > 1. The main in memory representations 

Re: [gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2018-12-10 Thread Even Rouault
Björn,

> In my spare time I've been working on a vector file format called
> FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
> I'm putting time into it is that I think shapefile still rules the
> read/query static data performance game, which is kind of sad, and probably
> one of the reasons it is still so widely popular. Also, the main competitor
> (GeoPackage) isn't suitable for streaming access (AFAIK)

I suspect that you could organize the layout of a sqlite3 file to be streaming 
friendly, but the sqlite3 lib itself is probably not ready for that (or you 
have to cheat by buffering a sufficiently large number of bytes in memory and 
use a custom VFS to read in that buffer. at that point, implementing your own 
dedicated sqlite3 reader might be better. likely doable, but not trivial). 
Random access is possible (you can use /vsicurl/ etc on a geopackage), but it 
might involve a number of seeks and small reads in the B-Tree / R-Tree.

That raises a major question. Which use case(s) do you want to address 
specifically. From what I've understood, for network access:
- streaming access for progressive display as bytes come in
- efficient bbox queries with minimized number of bytes and ranges in the 
files to read (minimizing the number of ranges of bytes is probably the most 
important criterion since reading 1 byte or 1000 will take the same time)

> I think I'm starting to get somewhere, more details and source is at
> https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
> concept driver implementation at
> https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
> already quite promising - it can do roundtrip read/write and is already
> quite a bit faster than shapefile. I also have implemented naive read only
> QGIS provider for experimental purposes.

What are the I and O in the I+O related to the R-Tree index ?

Wondering if a 3D index could be an option in case you want to address the 
full 3D case at some point. But might be something for later.

I'm not familiar with flatbuf, but is random access in the Feature table by 
feature index is possible (without a preliminary scan pass), similarly to a 
.shx file in a shapefile ?

Just a detail: it could be nice to have some global flag in the header that 
would mean "index of any feature = its FID, its FID - 1, or no particular 
correlation between feature index and feature FID"

For completness of the attribute types, you could have date, time and binary. 
Can the concept of null value or empty value or both for a field value be 
encoded ?

The Column structure could also receive more optional info: a long 
description, maximum length for string, optional precision/scale formatting 
for those nostalgic of decimal formatting 

If you want full genericity to express a SRS, allowing a WKT CRS string as an 
alternative for authority+code.

> 
> Basically I'm fishing for interest in this effort, hoping that others will
> see potential in it even if it's "yet another format" and far from
> final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
> place for a format specification and reference implementation to incubate.
> 
> Some additional food for thought that I've had during the experimentation:
> 
> 1. The main in memory representations of geometry/coordinates seem to be
> either separate arrays per dimension (GDAL (partially?) and QGIS) or a
> single array with dimension as stride. I've chosen the latter as of yet but
> I'm still a bit undecided. There is of course a substantial involved in
> transforming between the two representations so the situation with two
> competing models is a bit unfortunate. I'm also not sure about which of
> these models are objectively "better" than the other?

Why not just using WKB encoding since it has likely similar size and 
performance characteristics to the flatbuf encoding, with the main advantage 
of being widely implemented and supporting other geometry types like 
CircularString, PolyhedralSurface, etc..., which you need if you want to fully 
compete with GeoPackage ?

> 
> 2. One could get better space efficiency with protobuf instead of
> flatbuffers, but it has a performance cost. I have not gone far into
> investigating how much though and one could even reason about supporting
> both these encodings in a single format specification but it's taking it a
> bit far I think.

Contradicts a bit my above point, but if you decide for a custom geometry 
encoding, why not allowing int32 for vertex values, with a offset+scale 
setting ? (ala OpenStreetmap PBF)

If the main use case you want to address if cloud use, I was wondering if it 
would make sense to add a compression layer (ZSTD compress each Feature ?, or 
a group of features) to reduce the time spent in waiting for data from the 
network. Or perhaps not, and just let the transport layer do the compression 
(HTTP Encoding: gzip)

> 4. FlatGeobuf is perhaps a too technical name, not 

[gdal-dev] FlatGeobuf; proposal for a new performance oriented vector file format

2018-12-09 Thread Björn Harrtell
Hi GDAL/OGR folks,

In my spare time I've been working on a vector file format called
FlatGeobuf (tentatively). The main reason, besides curiosity and learning,
I'm putting time into it is that I think shapefile still rules the
read/query static data performance game, which is kind of sad, and probably
one of the reasons it is still so widely popular. Also, the main competitor
(GeoPackage) isn't suitable for streaming access (AFAIK) which shapefiles
also handles surprisingly (?) well.

By using a performance focused write once binary encoding (flatbuffers),
schema constraint and focusing on read/query performance by clustering on
an optimal spatial index (Packed Hilbert R-Tree) I hope to be able to beat
shapefile performance and at the same time be optimal for streaming/cloud
access.

I think I'm starting to get somewhere, more details and source is at
https://github.com/bjornharrtell/flatgeobuf and I have an early proof of
concept driver implementation at
https://github.com/bjornharrtell/gdal/tree/flatgeobuf and results are
already quite promising - it can do roundtrip read/write and is already
quite a bit faster than shapefile. I also have implemented naive read only
QGIS provider for experimental purposes.

Basically I'm fishing for interest in this effort, hoping that others will
see potential in it even if it's "yet another format" and far from
final/stable yet. Any feedback is welcome. As I see it, GDAL is a good
place for a format specification and reference implementation to incubate.

Some additional food for thought that I've had during the experimentation:

1. The main in memory representations of geometry/coordinates seem to be
either separate arrays per dimension (GDAL (partially?) and QGIS) or a
single array with dimension as stride. I've chosen the latter as of yet but
I'm still a bit undecided. There is of course a substantial involved in
transforming between the two representations so the situation with two
competing models is a bit unfortunate. I'm also not sure about which of
these models are objectively "better" than the other?

2. One could get better space efficiency with protobuf instead of
flatbuffers, but it has a performance cost. I have not gone far into
investigating how much though and one could even reason about supporting
both these encodings in a single format specification but it's taking it a
bit far I think.

3. Is there a reason to support different packing strategies for the R-Tree
or is Packed Hilbert a good compromise (besides it being beautifully simple
to implement)?

4. FlatGeobuf is perhaps a too technical name, not catchy enough and has a
bad/boring abbreviation. Other candidates I've considered are OpenGeoFile,
OpenVectorFile or OpenSpatialFile but I'm undecided. Any ideas? :)

Regards,

Björn Harrtell
___
gdal-dev mailing list
gdal-dev@lists.osgeo.org
https://lists.osgeo.org/mailman/listinfo/gdal-dev