On Wed, Apr 29, 2015 at 01:35:29AM +0200, andrew byrd wrote: > Over the last few years I have worked on several pieces of software that > consume and produce the PBF format. I have always appreciated the > advantages of PBF over XML for our use cases, but over time it became > apparent to me that PBF is significantly more complex than would be > necessary to meet its objectives of speed and compactness. > > Based on my observations about the effectiveness of various techniques > used in PBF and other formats, I devised an alternative OSM > representation that is consistently about 8% smaller than PBF but > substantially simpler to encode and decode. This work is presented in an > article at http://conveyal.com/blog/2015/04/27/osm-formats/. I welcome > any comments you may have on this article or on the potential for a > shift to simpler binary OSM formats.
I agree that the PBF format is rather complex. But it has some nice properties we shouldn't forget. First and foremost that is the block structure. This allows generating and parsing in multiple threads. I think thats an important optimization going forward. Not that that would be difficult to add to your format, simply adding a length field before the blocks you propose and compressing each one on its own would more or less do it. I also think it is important to have some kind of header for storing file meta data in a flexibly way, PBF has that. Looking at your proposal you seem to be very concerned with file size but not so much with read/write speed. From my experience reading and writing PBF is always CPU bound. Removing complexity could speed this up considerably. But if the price is that we need zlib (de)compression it might not be worth it, because it is rather CPU and memory intensive. Currently you can save quite a lot of CPU time if you do not compress the PBF blocks but leave them uncompressed. Of course the file size goes up, but if you have the storage space that doesn't matter that much. Another issue we have to keep in mind is memory usage. The usual compression algorithms works better if they are run on larger pieces of data, but it means you need a lot of memory for the original data and the compressed data at the same time. This might not matter in many cases, but if you are reading and writing lots of files at the same time and/or need your memory for other things, too (which is usually the case), this might become important. The PBF format, for that matter, is pretty problematic in that regard, because of the string table and because of the inefficient way the Google Protobuf library deals with memory management. About the stats in your blog posts comparing the different formats: First, I'd like to see the numbers for the whole planet. A size difference between small extracts doesn't really matter all that much, because the absolute size is so small. Savings on the whole planet file would be much more interesting. Second: The XML and PBF format usually contain the metadata that you removed in your VEX format. Have you accounted for that in your numbers? Ie. did you remove the metadata from XML and PBF, too? I think the numbers including the meta data would be much more interesting. It is, of course, okay, if you don't need that data to remove it for your internal use. But if we are talking about a possible future standard for OSM data thats used in many places, we need at least the option of having that data. Incidentally I came up with a similar text format as you did. It is documented here: http://osmcode.org/libosmium/manual/libosmium-manual.html#opl-object-per-line-format Jochen -- Jochen Topf [email protected] http://www.jochentopf.com/ +49-351-31778688 _______________________________________________ dev mailing list [email protected] https://lists.openstreetmap.org/listinfo/dev

