Hello! I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. It is 5x-10x faster at reading and writing and 30-50% smaller; an entire planet, including all metadata, can be read in about 12 minutes and written in about 50 minutes on a 3 year old dual-core machine. I have implemented an osmosis reader and writer and have enhancements to the map splitter to read the format. Code is pure Java and uses Google protocol buffers for the low-level serialization.
Comparing the file sizes: 8.2gb planet-100303.osm.bz2 12 gb planet-100303.osm.gz 5.2gb planet-omitmeta.bin 6.2gb planet.bin The omitmeta version omits the uid/user/version/timestamp metadata fields on each entity and are faster to generate and read. The design is very extensible. The low-level file format is designed to support random access at the 'fileblock' granularity, where a fileblock can contain ~8k OSM entities. There is *no* tag hardcoding used; all keys and values are stored in full as opaque strings. For future scalability, 64 bit node/way/relation ID's are assumed. The current serializer preserves the order of OSM entities and tags on OSM entities. To flexibly handle multiple resolutions, the granularity, or resolution used for representing locations and timestamps is adjustable in multiples of 1 millisecond and 1 nanodegree and can be set independently for each fileblock. The default scaling factor is 1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at the equator. These are the current resolution of the OSM database. Smaller files can be generated. At 10 microdegrees granularity, corresponding to about 1m of resolution, the filesize decreases by about 1gb. Space may also be saved by removing uninteresting UUID tags or perhaps by having stronger geographic locality when building the file. I have also tested the binary format on some SRTM contour lines in OSM 0.5 XML format, obtaining about a 50:1 compression ratio. This might be further improved by choosing a granularity equal to the isohypsis grid size. // Testing I have tested this code on the Cloudmade extract of Rhode Island. After converting the entire file to and from binary format, the XML output is bytewise identical to original file except for the one line indicating the osmosis version number. When run through the splitter, the output is not bytewise identical to before because of round-off errors 16 digits after the decimal point; this could be fixed by having the splitter behave like osmosis and only output 7 significant digits. // To use: Demonstration code is available on github at http://github.com/scrosby/OSM-Osmosis and http://github.com/scrosby/OSM-splitter See the 'master' branches. Please note that this is at present unpackaged demonstration code and the fileformat may change to incorporate suggestions. Also note that the shared code between the splitter and osmosis currently lives in the osmosis git repository. You'll also need to go into the crosby.binary directory and run the protocol compiler ('protoc') on the .proto files (See comments in those files for the command line.). /// The design /// I use Google protocol buffers for the low-level store. Given a specification file of one or more messages, the protocol buffer compiler writes low-level serialization code. Messages may contain other messages, forming hierarchical structures. Protocol buffers also support extensibility; new fields can be added to a message and old clients can read those messages without recompiling. For more details, please see http://code.google.com/p/protobuf/. Google officially supports C++, Java, and Python, but compilers exist for other languages. An example message specification is: message Node { required sint64 id = 1; required sint64 lat = 7; required sint64 lon = 8; repeated uint32 keys = 9 [packed = true]; // Denote strings repeated uint32 vals = 10 [packed = true];// Denote strings optional Info info = 11; // Contains metadata } Protocol buffers use a variable-bit encoding for integers. An integer is encoded at 7 bits per byte, where the high bit indicates whether or not the next byte is to be read. When messages contain small integers, the filesize is minimized. Two encodings exist, one intended for mostly positive integers and one for signed integers. In the standard encoding, integers [0,127] require one byte, [128,16383] require two bytes, etc. In the signed number encoding, the sign bit is placed in the least significant position; numbers [-64,63] require one byte, [-8192,8191] require two bytes, and so forth. To encode OSM entities into protocol buffers, I collect 8k entities to form a 'file block', Within each block, I extract all strings (key, value, role, user) into a separate string table. Thereafter, strings are referred to by their index into this table. To ensure that frequently used strings have small indexes, I sort the string table by the use frequency for each string. To improve deflate compressibility of the stringtable I then sort strings that have the same frequency lexiconographically. For ways and relations, which contain the ID's of other nodes, I exploit the tendency of consecutive nodes in a way or relation to have nearby node ID's by using delta compression, resulting in small integers. Within each block, I then divide entities into groups that contain consecutive messages all of the same type (node/way/relation), with one interesting case. If there is a batch of consecutive nodes to be output that have no tags at all, I use a special dense format. I omit the tags and store the group 'columnwise', as an array of ID's, array of latitudes, and array of longitudes, and delta-encode each column. This reduces header overheads and allows delta-coding to work very effectively. With the default ~1cm granularity, nodes within about 6 km of each other can be represented by as few as 7 bytes each plus the costs of the metadata if it is included. message DenseNodes { repeated sint64 id = 1 [packed = true]; // DELTA coded repeated sint64 lat = 7 [packed = true]; // DELTA coded repeated sint64 lon = 8 [packed = true]; // DELTA coded repeated Info info = 12; } After being serialized into a string, each block is gzip/deflate compressed individually. When deflate compression is not used, the file is somewhat faster to generate and parse and filesizes increases from 6.2gb to 21gb for the file containing metadata and from 5.2 to 8.8gb for the file without metadata. This means that metadata accounts for 60% of the uncompressed file size and compression reduces the size of the map data by 45% and the metadata by 87%. In osmosis, I can select the number of entities in a block, whether deflate compression is used, and the granularity at the command line. // Common code Unfortunately, osmosis, mkgmap, and the splitter each use a different representation for OSM entities. This means that I have to reimplement the entity serialization code for each program and there is less common code between the implementations than I would like. I have only implemented a serializer and deserializer for osmosis and a deserializer for the splitter. Some code is common, for instance the protocol buffer definitions and code for reading and parsing at the file level. // Changes to osmosis The changes to osmosis are just some new tasks to handle reading and writing the binary format. Possible future improvements could include extending the fileformat to support changelists or using this format for storing data internally. // Changes to the splitter As my binary format is about one quarter the size of the cache used by the current splitter design, I ripped out that code. I then refactored the file reader code of the splitter to operate at the level of entities, not XML tags. Finally, I added the binary parser. When splitting the entire planet file, the first pass takes about 10 minutes to read the file and calculate the regions. The remaining passes are much slower as they generate gzipped XML. I have not enhanced the splitter to generate binary output as mkgmap does not support the binary file format and would require refactoring. However, I suspect that if it generated binary output, the splitter could do a full split of the planet in one to two hours. Currently, the splitter requires that the osmosis be in the classpath in order to be able to access the common code. // Possible enhancements In the design, each block is independently decodable and there is support for attaching metadata (E.g., bounding boxes) so that unneeded blocks can be skipped. If nodes and ways and relations are placed into blocks with high geographic locality, a bounding box test may be able to skip substantial portions of the file and increase efficiency for extracting relevant subsets. Alternative indexing structures are possible, such as cross-references between a blocks. Currently, no such metadata is generated or used. The extendability of protocol buffer messages to support optional fields can be used to cheaply store optional extra data, such as bounding boxes, in nodes, ways and relations. There is no space penalty in having unused fields in files that do not contain optional data, except that each field requires permanently reserving a tag number (the '= 7'). Small tag numbers in [1,15] are more valuable than larger numbers because they require only one byte of header overhead. Tag numbers in [16,2047] require two bytes. I estimate that adding a bounding box to each way would require 20-25 bytes per way for 50M ways or about a gigabyte when deflate compression is disabled. // Rejected design choices There are a few design choices I rejected. I chose not to delta encode node ID's, latitudes, and longitudes for nodes that contain tags. Although this would save around 3-5%, it makes the parser and serializer more complex. I considered specially encoding tags with integer-valued or float-valued values. They occur fairly often in a planet dump (~15%), but encoding them in binary saves almost nothing (.2%) and is a significant slowdown. // TODO's Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. For future-compatibility reasons, I need to consider including file-format flags and version numbers in the file header to indicate when a file includes features that a reader must support. (e.g., dense nodes, as described above.) // Questions Ideas, suggestions, improvements, and other uses of this format? Should I checkin the protobuf.jar and generated code? Can anyone suggest how to add appropriate ant rules, or know of a convenient ant buildfile generator/editor that integrates well with eclipse. // Command lines: Some example command lines to generate files, for invocation from the build directory ~/source/osmosis/eclipse: java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-all.bin java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet.bin omitmetadata=true # Version with ~1m precision java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" org.openstreetmap.osmosis.core.Osmosis --read-xml file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-for-display.bin omitmetadata=true granularity=10000
_______________________________________________ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev