Hello!

I would like to announce code implementing a binary OSM format that
supports the full semantics of the OSM XML. It is 5x-10x faster at
reading and writing and 30-50% smaller; an entire planet, including
all metadata, can be read in about 12 minutes and written in about 50
minutes on a 3 year old dual-core machine. I have implemented an
osmosis reader and writer and have enhancements to the map splitter to
read the format. Code is pure Java and uses Google protocol buffers
for the low-level serialization.

Comparing the file sizes:

 8.2gb   planet-100303.osm.bz2
12  gb   planet-100303.osm.gz
 5.2gb   planet-omitmeta.bin
 6.2gb   planet.bin

The omitmeta version omits the uid/user/version/timestamp metadata
fields on each entity and are faster to generate and read.

The design is very extensible. The low-level file format is designed
to support random access at the 'fileblock' granularity, where a
fileblock can contain ~8k OSM entities. There is *no* tag hardcoding
used; all keys and values are stored in full as opaque strings. For
future scalability, 64 bit node/way/relation ID's are assumed. The
current serializer preserves the order of OSM entities and tags on OSM
entities. To flexibly handle multiple resolutions, the granularity, or
resolution used for representing locations and timestamps is
adjustable in multiples of 1 millisecond and 1 nanodegree and can be
set independently for each fileblock. The default scaling factor is
1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at
the equator. These are the current resolution of the OSM database.

Smaller files can be generated. At 10 microdegrees granularity,
corresponding to about 1m of resolution, the filesize decreases by
about 1gb. Space may also be saved by removing uninteresting UUID
tags or perhaps by having stronger geographic locality when building
the file.

I have also tested the binary format on some SRTM contour lines in OSM
0.5 XML format, obtaining about a 50:1 compression ratio. This might
be further improved by choosing a granularity equal to the isohypsis
grid size.

// Testing

I have tested this code on the Cloudmade extract of Rhode
Island. After converting the entire file to and from binary format,
the XML output is bytewise identical to original file except for the
one line indicating the osmosis version number.

When run through the splitter, the output is not bytewise identical to
before because of round-off errors 16 digits after the decimal point;
this could be fixed by having the splitter behave like osmosis and
only output 7 significant digits.

// To use:

Demonstration code is available on github at

   http://github.com/scrosby/OSM-Osmosis       and
   http://github.com/scrosby/OSM-splitter

See the 'master' branches.

Please note that this is at present unpackaged demonstration code and
the fileformat may change to incorporate suggestions. Also note that
the shared code between the splitter and osmosis currently lives in
the osmosis git repository. You'll also need to go into the
crosby.binary directory and run the protocol compiler ('protoc') on
the .proto files (See comments in those files for the command line.).


/// The design ///

I use Google protocol buffers for the low-level store. Given a
specification file of one or more messages, the protocol buffer
compiler writes low-level serialization code. Messages may contain
other messages, forming hierarchical structures. Protocol buffers also
support extensibility; new fields can be added to a message and old
clients can read those messages without recompiling. For more details,
please see http://code.google.com/p/protobuf/. Google officially
supports C++, Java, and Python, but compilers exist for other
languages.  An example message specification is:

message Node {
   required sint64 id = 1;
   required sint64 lat = 7;
   required sint64 lon = 8;
   repeated uint32 keys = 9 [packed = true]; // Denote strings
   repeated uint32 vals = 10 [packed = true];// Denote strings
   optional Info info = 11; // Contains metadata
}


Protocol buffers use a variable-bit encoding for integers. An integer
is encoded at 7 bits per byte, where the high bit indicates whether or
not the next byte is to be read. When messages contain small integers,
the filesize is minimized. Two encodings exist, one intended for
mostly positive integers and one for signed integers. In the standard
encoding, integers [0,127] require one byte, [128,16383] require two
bytes, etc. In the signed number encoding, the sign bit is placed in
the least significant position; numbers [-64,63] require one byte,
[-8192,8191] require two bytes, and so forth.

To encode OSM entities into protocol buffers, I collect 8k entities to
form a 'file block', Within each block, I extract all strings (key,
value, role, user) into a separate string table. Thereafter, strings
are referred to by their index into this table. To ensure that
frequently used strings have small indexes, I sort the string table by
the use frequency for each string. To improve deflate compressibility
of the stringtable I then sort strings that have the same frequency
lexiconographically.

For ways and relations, which contain the ID's of other nodes, I
exploit the tendency of consecutive nodes in a way or relation to have
nearby node ID's by using delta compression, resulting in small integers.


Within each block, I then divide entities into groups that contain
consecutive messages all of the same type (node/way/relation), with
one interesting case. If there is a batch of consecutive nodes to be
output that have no tags at all, I use a special dense format. I omit
the tags and store the group 'columnwise', as an array of ID's, array
of latitudes, and array of longitudes, and delta-encode each
column. This reduces header overheads and allows delta-coding to work
very effectively. With the default ~1cm granularity, nodes within
about 6 km of each other can be represented by as few as 7 bytes
each plus the costs of the metadata if it is included.

message DenseNodes {
   repeated sint64 id = 1 [packed = true]; // DELTA coded
   repeated sint64 lat = 7 [packed = true]; // DELTA coded
   repeated sint64 lon = 8 [packed = true]; // DELTA coded
   repeated Info info = 12;
}

After being serialized into a string, each block is gzip/deflate
compressed individually. When deflate compression is not
used, the file is somewhat faster to generate and parse and
filesizes increases from 6.2gb to 21gb for the file
containing metadata and from 5.2 to 8.8gb for the file without
metadata. This means that metadata accounts for 60% of the
uncompressed file size and compression reduces the size of the map
data by 45% and the metadata by 87%.

In osmosis, I can select the number of entities in a block, whether
deflate compression is used, and the granularity at the command line.

// Common code

Unfortunately, osmosis, mkgmap, and the splitter each use a different
representation for OSM entities. This means that I have to reimplement
the entity serialization code for each program and there is less
common code between the implementations than I would like. I have only
implemented a serializer and deserializer for osmosis and a
deserializer for the splitter. Some code is common, for instance the
protocol buffer definitions and code for reading and parsing at the
file level.

// Changes to osmosis

The changes to osmosis are just some new tasks to handle reading and
writing the binary format.  Possible future improvements could include
extending the fileformat to support changelists or using this format
for storing data internally.

// Changes to the splitter

As my binary format is about one quarter the size of the cache used by
the current splitter design, I ripped out that code. I then refactored
the file reader code of the splitter to operate at the level of
entities, not XML tags. Finally, I added the binary parser.  When
splitting the entire planet file, the first pass takes about 10
minutes to read the file and calculate the regions. The remaining
passes are much slower as they generate gzipped XML. I have not
enhanced the splitter to generate binary output as mkgmap does not
support the binary file format and would require refactoring.
However, I suspect that if it generated binary output, the splitter
could do a full split of the planet in one to two hours.

Currently, the splitter requires that the osmosis be in the classpath
in order to be able to access the common code.


// Possible enhancements

In the design, each block is independently decodable and there is
support for attaching metadata (E.g., bounding boxes) so that unneeded
blocks can be skipped. If nodes and ways and relations are placed into
blocks with high geographic locality, a bounding box test may be able
to skip substantial portions of the file and increase
efficiency for extracting relevant subsets. Alternative indexing
structures are possible, such as cross-references between a blocks.
Currently, no such metadata is generated or used.

The extendability of protocol buffer messages to support optional
fields can be used to cheaply store optional extra data, such as
bounding boxes, in nodes, ways and relations. There is no space
penalty in having unused fields in files that do not contain optional
data, except that each field requires permanently reserving a tag
number (the '= 7'). Small tag numbers in [1,15] are more valuable than
larger numbers because they require only one byte of header
overhead. Tag numbers in [16,2047] require two bytes.

I estimate that adding a bounding box to each way would require 20-25
bytes per way for 50M ways or about a gigabyte when deflate
compression is disabled.

// Rejected design choices

There are a few design choices I rejected. I chose not to delta encode
node ID's, latitudes, and longitudes for nodes that contain tags.
Although this would save around 3-5%, it makes the parser and
serializer more complex.

I considered specially encoding tags with integer-valued or
float-valued values. They occur fairly often in a planet dump (~15%),
but encoding them in binary saves almost nothing (.2%) and is a
significant slowdown.

// TODO's

Probably the most important TODO is packaging and fixing the build system.
I have no almost no experience with ant and am unfamiliar with java
packaging practices, so I'd like to request help/advice on ant and
suggestions on
how to package the common parsing/serializing code so that it can be
re-used across different programs.

For future-compatibility reasons, I need to consider including
file-format flags and version numbers in the file header to indicate
when a file includes features that a reader must support. (e.g., dense
nodes, as described above.)

// Questions

Ideas, suggestions, improvements, and other uses of this format?

Should I checkin the protobuf.jar and generated code?

Can anyone suggest how to add appropriate ant rules, or know of a
convenient ant buildfile generator/editor that integrates well with
eclipse.

// Command lines:

Some example command lines to generate files, for invocation from
the build directory ~/source/osmosis/eclipse:


java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
org.openstreetmap.osmosis.core.Osmosis --read-xml
file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-all.bin


java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
org.openstreetmap.osmosis.core.Osmosis --read-xml
file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet.bin
omitmetadata=true

# Version with ~1m precision
java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar"
org.openstreetmap.osmosis.core.Osmosis --read-xml
file=/mnt/map/planet-100303.osm.gz --lp --write-bin
file=planet-for-display.bin omitmetadata=true granularity=10000
_______________________________________________
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev

Reply via email to