Re: [OSM-dev] Simpler binary OSM formats

Andrew Byrd Mon, 08 Feb 2016 03:49:11 -0800

Hello Benjamin,

I was aware of Cap’n Proto, but thanks for pointing out FlatBuffer. I’ve 
studied this system and considered how it might be useful for OSM data 
exchange. Here are my impressions:

1. Each FlatBuffer message does indirection through a table "to allow for 
format evolution and optional fields”. The basic OSM data model is quite stable 
at this point and to my knowledge evolves only through the introduction of 
different tag strings. Unlike existing formats, I’d like vex to be extremely 
simple and non-extensible so developers can easily and completely support 
reading or writing it. I would hesitate to devote space in every serialized 
entity to unused extensibility features. 

2. FlatBuffer messages use fixed-width integers throughout, for both field 
values and vtable entries. OSM entity IDs are now 64 bits wide. Vtable entries 
are 32 bits wide and are used to refer to all strings and vectors, which are 
“never stored in-line”. The buffer will contain a very large proportion of 
zeros and repeated or unnecessary bytes (redundant fragments of coordinates and 
successive OSM entity references, offsets to strings and vectors). To get even 
remotely close to the file sizes we are accustomed to, the FlatBuffers would 
need to be inside compressed blocks. To achieve anything like comparable file 
sizes, we’d want to delta-code most numeric fields and probably apply 
variable-byte coding, i.e. pre-filter the data to assist the general purpose 
compression in its job. However, FlatBuffer inherently does not support 
variable-width integers.

3. Generally speaking, I can certainly see the appeal of using code generated 
from a schema to support a format quickly and reliably in several languages. 
But one of the main difficulties I encountered with OSM PBF is that it requires 
the developer to mix automatically generated Protobuf code with various bits of 
hand-rolled code to handle the block structure, compression, delta coding, 
string tables, etc. diminishing the appeal of code generation. In a well 
designed format, the code to parse each individual OSM entity (or interpret it 
in-place) could in fact be quite simple compared to this compression and 
block-handling code, and I’m not sure we gain much by generating it. To achieve 
reasonably compact file sizes, FlatBuffer would still require mixing custom 
code into and around generated code. This would defeat one of my major design 
goals.

4. FlatBuffer allows accessing buffer contents without parsing or dynamic 
allocations, which is a laudable goal. However, the vex format as it is 
currently defined would also allow iterative access to every entity with no 
dynamic allocations, requiring only an initial pass over each entity to 
determine the offsets of tags, references, etc. before use. You could refer to 
this as “parsing the entity” but I expect it would have a near zero impact on 
speed (and potentially zero impact considering that the data needs to be pulled 
into the processor cache for use anyway). Also, the file sizes we are 
accustomed to depend on delta coding, which is a cumulative process. While 
entire blocks may be skipped over, we must scan over all entities within a 
block to progressively decode coordinates or entity references. Random access 
within a block is not compatible with delta coding, nor do I see much use for 
it in a bulk data transfer and archiving format. So I think it’s a non-problem 
that we have to sequentially interpret the entities within each block.

Of course I may have misunderstood something about your suggestion or the use 
cases you had in mind. As always I’d welcome any reactions or discussion. My 
intent here is not to defend a specification set in stone, but to see if there 
is a technical consensus on what a next generation OSM format could look like.

Regards,
Andrew

> On 06 Feb 2016, at 23:47, Stadin, Benjamin 
> <[email protected]> wrote:
> 
> Hi Andrew,
> 
> Cap'n Proto (successor of ProtoBuffer from the guy who invented ProtoBuffer) 
> and FlatBuffers (another ProtoBuffer succesor, by Google) have gained a lot 
> of traction since last year. Both eliminate many if the shortcomings of the 
> original ProtoBuffer (allow for random access, streaming,...), and improve on 
> performance also.
> 
> https://github.com/google/flatbuffers <https://github.com/google/flatbuffers>
> 
> Here is a comparison between ProtoBuffer competitors:
> https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html 
> <https://capnproto.org/news/2014-06-17-capnproto-flatbuffers-sbe.html>
> 
> In my opinion FlatBuffers is the most interesting. It seems to have very good 
> language and platform support, and has quite a high adoption rate already. 
> 
> I think that it's well worth to reconsider creating an own file format and 
> parser for several reasons. Your concept looks well thought, it should be 
> possible to implement a lighweight parser using FlatBuffers for your data 
> scheme. 
> 
> Regards
> Ben 
> 
> Von meinem iPad gesendet
> 
> Am 06.02.2016 um 22:37 schrieb Andrew Byrd <[email protected] 
> <mailto:[email protected]>>:
> 
>> Hello OSM developers,
>> 
>> Last spring I posted an article discussing some shortcomings of the PBF 
>> format and proposing a simpler binary OSM interchange format called VEX. 
>> There was a generally positive response at the time, including helpful 
>> feedback from other developers. Since then I have revised the VEX 
>> specification as well as our implementation, and Conveyal has been using 
>> this format in our own day-to-day work.
>> 
>> I have written a new article describing of the revised format:
>> http://conveyal.com/blog/2016/02/06/vex-format-part-two 
>> <http://conveyal.com/blog/2016/02/06/vex-format-part-two>
>> 
>> The main differences are 1) it is more regular and even simpler to parse; 
>> and 2) file blocks are compressed individually, allowing parallel processing 
>> and seeking to specific entity types. It is no longer smaller than PBF, but 
>> still comparable in size.
>> 
>> Again, I would welcome any comments you may have on the revised format and 
>> the potential for a shift to simpler binary OSM formats.
>> 
>> Regards,
>> Andrew Byrd
>> 
>> 
>>> On 29 Apr 2015, at 01:35, andrew byrd <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Hello OSM developers,
>>>  
>>> Over the last few years I have worked on several pieces of software that 
>>> consume and produce the PBF format. I have always appreciated the 
>>> advantages of PBF over XML for our use cases, but over time it became 
>>> apparent to me that PBF is significantly more complex than would be 
>>> necessary to meet its objectives of speed and compactness.
>>>  
>>> Based on my observations about the effectiveness of various techniques used 
>>> in PBF and other formats, I devised an alternative OSM representation that 
>>> is consistently about 8% smaller than PBF but substantially simpler to 
>>> encode and decode. This work is presented in an article at 
>>> http://conveyal.com/blog/2015/04/27/osm-formats/ 
>>> <http://conveyal.com/blog/2015/04/27/osm-formats/>. I welcome any comments 
>>> you may have on this article or on the potential for a shift to simpler 
>>> binary OSM formats.
>>>  
>>> Regards,
>>> Andrew Byrd
>>> _______________________________________________
>>> dev mailing list
>>> [email protected] <mailto:[email protected]>
>>> https://lists.openstreetmap.org/listinfo/dev 
>>> <https://lists.openstreetmap.org/listinfo/dev>
>> 
>> _______________________________________________
>> dev mailing list
>> [email protected] <mailto:[email protected]>
>> https://lists.openstreetmap.org/listinfo/dev 
>> <https://lists.openstreetmap.org/listinfo/dev>

_______________________________________________
dev mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] Simpler binary OSM formats

Reply via email to