I got it built, Build instructions : first get the protobuf, and build manually,
526 svn checkout http://protobuf.googlecode.com/svn/trunk/ protobuf-read-only 527 ls 528 cd protobuf-read-only/ 529 ls 530 bash ./autogen.sh 531 ls 532 make 533 ./configure 534 make 535 sudo make install go into the java dir and use ant to build the jar files. got to the sources of the OSM tool : install the protobuf into the lib/compile, normally this will be handled by ivy, but there are issues. cp /home/mdupont/experiments/osm/protobuf-read-only/java/target/protobuf-java-2.3.1-pre.jar lib/compile/ 607 cd src/ 611 cd crosby/ 613 cd binary/ 614 ls 615 protoc --java_out=../.. fileformat.proto 616 protoc --java_out=../.. osmformat.proto this generates the needed code. now you can build. On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby <[email protected]> wrote: > Hello! > > I would like to announce code implementing a binary OSM format that > supports the full semantics of the OSM XML. It is 5x-10x faster at > reading and writing and 30-50% smaller; an entire planet, including > all metadata, can be read in about 12 minutes and written in about 50 > minutes on a 3 year old dual-core machine. I have implemented an > osmosis reader and writer and have enhancements to the map splitter to > read the format. Code is pure Java and uses Google protocol buffers > for the low-level serialization. > > Comparing the file sizes: > > 8.2gb planet-100303.osm.bz2 > 12 gb planet-100303.osm.gz > 5.2gb planet-omitmeta.bin > 6.2gb planet.bin > > The omitmeta version omits the uid/user/version/timestamp metadata > fields on each entity and are faster to generate and read. > > The design is very extensible. The low-level file format is designed > to support random access at the 'fileblock' granularity, where a > fileblock can contain ~8k OSM entities. There is *no* tag hardcoding > used; all keys and values are stored in full as opaque strings. For > future scalability, 64 bit node/way/relation ID's are assumed. The > current serializer preserves the order of OSM entities and tags on OSM > entities. To flexibly handle multiple resolutions, the granularity, or > resolution used for representing locations and timestamps is > adjustable in multiples of 1 millisecond and 1 nanodegree and can be > set independently for each fileblock. The default scaling factor is > 1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at > the equator. These are the current resolution of the OSM database. > > Smaller files can be generated. At 10 microdegrees granularity, > corresponding to about 1m of resolution, the filesize decreases by > about 1gb. Space may also be saved by removing uninteresting UUID > tags or perhaps by having stronger geographic locality when building > the file. > > I have also tested the binary format on some SRTM contour lines in OSM > 0.5 XML format, obtaining about a 50:1 compression ratio. This might > be further improved by choosing a granularity equal to the isohypsis > grid size. > > // Testing > > I have tested this code on the Cloudmade extract of Rhode > Island. After converting the entire file to and from binary format, > the XML output is bytewise identical to original file except for the > one line indicating the osmosis version number. > > When run through the splitter, the output is not bytewise identical to > before because of round-off errors 16 digits after the decimal point; > this could be fixed by having the splitter behave like osmosis and > only output 7 significant digits. > > // To use: > > Demonstration code is available on github at > > http://github.com/scrosby/OSM-Osmosis and > http://github.com/scrosby/OSM-splitter > > See the 'master' branches. > > Please note that this is at present unpackaged demonstration code and > the fileformat may change to incorporate suggestions. Also note that > the shared code between the splitter and osmosis currently lives in > the osmosis git repository. You'll also need to go into the > crosby.binary directory and run the protocol compiler ('protoc') on > the .proto files (See comments in those files for the command line.). > > > /// The design /// > > I use Google protocol buffers for the low-level store. Given a > specification file of one or more messages, the protocol buffer > compiler writes low-level serialization code. Messages may contain > other messages, forming hierarchical structures. Protocol buffers also > support extensibility; new fields can be added to a message and old > clients can read those messages without recompiling. For more details, > please see http://code.google.com/p/protobuf/. Google officially > supports C++, Java, and Python, but compilers exist for other > languages. An example message specification is: > > message Node { > required sint64 id = 1; > required sint64 lat = 7; > required sint64 lon = 8; > repeated uint32 keys = 9 [packed = true]; // Denote strings > repeated uint32 vals = 10 [packed = true];// Denote strings > optional Info info = 11; // Contains metadata > } > > > Protocol buffers use a variable-bit encoding for integers. An integer > is encoded at 7 bits per byte, where the high bit indicates whether or > not the next byte is to be read. When messages contain small integers, > the filesize is minimized. Two encodings exist, one intended for > mostly positive integers and one for signed integers. In the standard > encoding, integers [0,127] require one byte, [128,16383] require two > bytes, etc. In the signed number encoding, the sign bit is placed in > the least significant position; numbers [-64,63] require one byte, > [-8192,8191] require two bytes, and so forth. > > To encode OSM entities into protocol buffers, I collect 8k entities to > form a 'file block', Within each block, I extract all strings (key, > value, role, user) into a separate string table. Thereafter, strings > are referred to by their index into this table. To ensure that > frequently used strings have small indexes, I sort the string table by > the use frequency for each string. To improve deflate compressibility > of the stringtable I then sort strings that have the same frequency > lexiconographically. > > For ways and relations, which contain the ID's of other nodes, I > exploit the tendency of consecutive nodes in a way or relation to have > nearby node ID's by using delta compression, resulting in small integers. > > > Within each block, I then divide entities into groups that contain > consecutive messages all of the same type (node/way/relation), with > one interesting case. If there is a batch of consecutive nodes to be > output that have no tags at all, I use a special dense format. I omit > the tags and store the group 'columnwise', as an array of ID's, array > of latitudes, and array of longitudes, and delta-encode each > column. This reduces header overheads and allows delta-coding to work > very effectively. With the default ~1cm granularity, nodes within > about 6 km of each other can be represented by as few as 7 bytes > each plus the costs of the metadata if it is included. > > message DenseNodes { > repeated sint64 id = 1 [packed = true]; // DELTA coded > repeated sint64 lat = 7 [packed = true]; // DELTA coded > repeated sint64 lon = 8 [packed = true]; // DELTA coded > repeated Info info = 12; > } > > After being serialized into a string, each block is gzip/deflate > compressed individually. When deflate compression is not > used, the file is somewhat faster to generate and parse and > filesizes increases from 6.2gb to 21gb for the file > containing metadata and from 5.2 to 8.8gb for the file without > metadata. This means that metadata accounts for 60% of the > uncompressed file size and compression reduces the size of the map > data by 45% and the metadata by 87%. > > In osmosis, I can select the number of entities in a block, whether > deflate compression is used, and the granularity at the command line. > > // Common code > > Unfortunately, osmosis, mkgmap, and the splitter each use a different > representation for OSM entities. This means that I have to reimplement > the entity serialization code for each program and there is less > common code between the implementations than I would like. I have only > implemented a serializer and deserializer for osmosis and a > deserializer for the splitter. Some code is common, for instance the > protocol buffer definitions and code for reading and parsing at the > file level. > > // Changes to osmosis > > The changes to osmosis are just some new tasks to handle reading and > writing the binary format. Possible future improvements could include > extending the fileformat to support changelists or using this format > for storing data internally. > > // Changes to the splitter > > As my binary format is about one quarter the size of the cache used by > the current splitter design, I ripped out that code. I then refactored > the file reader code of the splitter to operate at the level of > entities, not XML tags. Finally, I added the binary parser. When > splitting the entire planet file, the first pass takes about 10 > minutes to read the file and calculate the regions. The remaining > passes are much slower as they generate gzipped XML. I have not > enhanced the splitter to generate binary output as mkgmap does not > support the binary file format and would require refactoring. > However, I suspect that if it generated binary output, the splitter > could do a full split of the planet in one to two hours. > > Currently, the splitter requires that the osmosis be in the classpath > in order to be able to access the common code. > > > // Possible enhancements > > In the design, each block is independently decodable and there is > support for attaching metadata (E.g., bounding boxes) so that unneeded > blocks can be skipped. If nodes and ways and relations are placed into > blocks with high geographic locality, a bounding box test may be able > to skip substantial portions of the file and increase > efficiency for extracting relevant subsets. Alternative indexing > structures are possible, such as cross-references between a blocks. > Currently, no such metadata is generated or used. > > The extendability of protocol buffer messages to support optional > fields can be used to cheaply store optional extra data, such as > bounding boxes, in nodes, ways and relations. There is no space > penalty in having unused fields in files that do not contain optional > data, except that each field requires permanently reserving a tag > number (the '= 7'). Small tag numbers in [1,15] are more valuable than > larger numbers because they require only one byte of header > overhead. Tag numbers in [16,2047] require two bytes. > > I estimate that adding a bounding box to each way would require 20-25 > bytes per way for 50M ways or about a gigabyte when deflate > compression is disabled. > > // Rejected design choices > > There are a few design choices I rejected. I chose not to delta encode > node ID's, latitudes, and longitudes for nodes that contain tags. > Although this would save around 3-5%, it makes the parser and > serializer more complex. > > I considered specially encoding tags with integer-valued or > float-valued values. They occur fairly often in a planet dump (~15%), > but encoding them in binary saves almost nothing (.2%) and is a > significant slowdown. > > // TODO's > > Probably the most important TODO is packaging and fixing the build system. > I have no almost no experience with ant and am unfamiliar with java > packaging practices, so I'd like to request help/advice on ant and > suggestions on > how to package the common parsing/serializing code so that it can be > re-used across different programs. > > For future-compatibility reasons, I need to consider including > file-format flags and version numbers in the file header to indicate > when a file includes features that a reader must support. (e.g., dense > nodes, as described above.) > > // Questions > > Ideas, suggestions, improvements, and other uses of this format? > > Should I checkin the protobuf.jar and generated code? > > Can anyone suggest how to add appropriate ant rules, or know of a > convenient ant buildfile generator/editor that integrates well with > eclipse. > > // Command lines: > > Some example command lines to generate files, for invocation from > the build directory ~/source/osmosis/eclipse: > > > java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" > org.openstreetmap.osmosis.core.Osmosis --read-xml > file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet-all.bin > > > java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" > org.openstreetmap.osmosis.core.Osmosis --read-xml > file=/mnt/map/planet-100303.osm.gz --lp --write-bin file=planet.bin > omitmetadata=true > > # Version with ~1m precision > java -cp ".:../lib/default/*:/usr/share/java/protobuf.jar" > org.openstreetmap.osmosis.core.Osmosis --read-xml > file=/mnt/map/planet-100303.osm.gz --lp --write-bin > file=planet-for-display.bin omitmetadata=true granularity=10000 > > > > _______________________________________________ > dev mailing list > [email protected] > http://lists.openstreetmap.org/listinfo/dev > > _______________________________________________ dev mailing list [email protected] http://lists.openstreetmap.org/listinfo/dev

