Re: [OSM-dev] New OSM binary fileformat implementation.
Mike, jamesmikedup...@googlemail.com wrote: Hi, ist there any documentation of the binary format changes? I have implemented a c++ reader using protobuf, would update that if there is a new format spec. It would be great if you could check whether your reader still works with the current implementation, and then I'd be extremely grateful for some sort of minimal package that contains only your reader and the stuff absolutely necessary to build it - I've checked out your http://github.com/h4ck3rm1k3/OSM-Osmosis but ended up with a tree that contained half (or al?) of Osmosis and lots of autoconf cruft but wasn't buildable for me because it expected Google protobuf stuff to be downloaded and installed separately and I didn't know what to get and where to install it! Background is, I would like to add binary format support to osm2pgsql and was hoping to be able to use your code for that. Scott, it would be great if apart from a name for the new binary format you'd also recommend a default file extension since some of our tools try to auto-detect the file format from the name (.osm.bz2, .osm.gz, .osm - maybe .osm.bin for the new stuff?). Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Wed, Apr 28, 2010 at 12:02 PM, Scott Crosby scrosb...@gmail.com wrote: Hello! I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. It is 5x-10x faster at reading and writing and 30-50% smaller; an entire planet, including all metadata, can be read in about 12 minutes and written in about 50 minutes on a 3 year old dual-core machine. I have implemented an osmosis reader and writer and have enhancements to the map splitter to read the format. Code is pure Java and uses Google protocol buffers for the low-level serialization. Comparing the file sizes: 8.2gb planet-100303.osm.bz2 12 gb planet-100303.osm.gz 5.2gb planet-omitmeta.bin 6.2gb planet.bin Some newer results. I have a modification to dense nodes to support storing tags. This results in a 500mb smaller files for an entire planet. Sizes are now: 4.7gb planet-omitmeta.bin 5.7gb planet.bin Results when dropping the resolution to 1m precision: 3.8gb planet-granularity=1-omitmeta.bin This reduced resolution format may be a good choice for distributing OSM snapshots to non-editors. I have tested the new file on the cloudmade extract of rhode_island; converting 122MB of uncompressed XML to and from binary format. The result is bytewise identical to the source file except for the osmosis version number at the top. Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Wed, Aug 4, 2010 at 7:17 PM, Brett Henderson br...@bretth.com wrote: On Tue, Aug 3, 2010 at 11:37 PM, Scott Crosby scro...@cs.rice.edu wrote: On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote: If we go down this path I need two things: 1. A versioned jar file containing all re-usable code. Scott, can you take care of this? I have split off the reusable code into a separate library, distinct from the osmosis-only code which is currently sitting in my osmosis git repository (published to github). I have created a git repo for the reusable code at http://github.com/scrosby/OSM-binary Note that the history is messy, so I will be rebasing that repository. How do I configure this project into building correctly and making a versioned jar file? 2. The Osmosis specific code that I can use in a new osmbin project within the Osmosis Subversion repo. I can probably get them from GIT if you let me know which files I need. I have a working version of the plugin published to my osmosis github mirror. I duct-taped it together by hacking osmosis_plugins.conf, but the binary plugin is working on trunk. Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 5:51 AM, jamesmikedup...@googlemail.com jamesmikedup...@googlemail.com wrote: Hi, ist there any documentation of the binary format changes? I have implemented a c++ reader using protobuf, would update that if there is a new format spec. mike No real docs. There are some tweaks on the edges with regards to renaming protocol buffer message names and field names. Mostly searchreplace. Field numbers may have changed, so the earliest files are not compatible. There are a few semantic differences that affect parsing. Offset numbers are now in the format that let the grid in the binary file be aligned with a grid in a dataset for datasets with a regular grid, such as isohypsis files. The other notable change is I have have extended DenseNodes to support tags and so have removed the former Node. This has resulted in '0' no longer being available for use as a string identifier. It is being used as a delimiter. In a week or two, depending on feedback and any resulting changes, I will upload reference files to github. If you update the reader, I would appreciate a copy, Thanks, Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Tue, Aug 3, 2010 at 11:37 PM, Scott Crosby scro...@cs.rice.edu wrote: On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote: I'll help incorporate this into the rest of Osmosis. There's a few things to work through though. I don't have a lot of time to work with this, but I can split up my working branch (which includes several unrelated changes) into separate orthogonal pieces. Git is *VERY* good at this. That would simplify integration. Is there a demand for the binary format in its current incantation? I'm not keen to incorporate it if nobody will use it. I think it would be used in the mkgmap splitter, if available. Can the code be managed in the main OSM Subversion repo instead of GIT? Yes. I use git personally, but there's very good SVN integration. Is any code reuse between Osmosis and other applications required? Yes. The *.proto files must be shared with other applications that use the binary format, including C/Java/Python/.net/ I wrote some java parser code that is intended to be shared across the other Java osmosis applications. (Eg, I'm using it in my splitter changes.) in crosby/binary/file and crosby/binary/*.java I suggest that all of this be put in a separate library along with jamesmikedupont's C/C++ code. Currently Osmosis is split into a number of sub-projects. For example, there's xml, apidb, pgsql, etc. This would be a new project, something like osmbin although that's a fairly generic name. But presumably we'd only be putting the Osmosis specific stuff in there. The osmbin project would need to have a dependency on an external lib that contains your re-usable code. That is the tricky bit. Osmosis currently retrieves external dependencies from the public maven repository at repo1.maven.org. The few libraries that aren't available there are checked in directly into the Osmosis managed repository stored in the build-support/repo directory. The simplest way to solve this is to create your third party library through whatever means you wish, then we check in the resultant (properly versioned) jar file into the Osmosis build-support/repo Ivy repository. Then the osmbin project can pull that lib in as a dependency and do a build. If we go down this path I need two things: 1. A versioned jar file containing all re-usable code. Scott, can you take care of this? 2. The Osmosis specific code that I can use in a new osmbin project within the Osmosis Subversion repo. I can probably get them from GIT if you let me know which files I need. Brett ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sat, Jul 31, 2010 at 11:26 AM, Frederik Ramm frede...@remote.org wrote: Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? I'm using it personally. I know of no other users, except that Nolan Darilek is interested in whether the format can be expanded with geographic indexing information. I have a few minor tweaks that I've been intending to make before declaring the format final. Basically, definining some optional fileformat fields (eg, is the file sorted? And on what paramater.) There's no infrastructure using these fields, however. How much interest is there in this code and format? Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote: I'll help incorporate this into the rest of Osmosis. There's a few things to work through though. I don't have a lot of time to work with this, but I can split up my working branch (which includes several unrelated changes) into separate orthogonal pieces. Git is *VERY* good at this. That would simplify integration. Is there a demand for the binary format in its current incantation? I'm not keen to incorporate it if nobody will use it. I think it would be used in the mkgmap splitter, if available. Can the code be managed in the main OSM Subversion repo instead of GIT? Yes. I use git personally, but there's very good SVN integration. Is any code reuse between Osmosis and other applications required? Yes. The *.proto files must be shared with other applications that use the binary format, including C/Java/Python/.net/ I wrote some java parser code that is intended to be shared across the other Java osmosis applications. (Eg, I'm using it in my splitter changes.) in crosby/binary/file and crosby/binary/*.java I suggest that all of this be put in a separate library along with jamesmikedupont's C/C++ code. Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com wrote: On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org wrote: Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? I'm curious about this as well. The main reason for me introducing the new project structure was to facilitate the integration of new features like this. They're relatively easy to add (some Ant and Ivy foo required ...), [...] The code hasn't changed a lot, but the build processes have. Well that's one of the thing Scott said he had no clue on how to do. From Scotts mail: Scott Crosby: // TODO's Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Hi, ist there any documentation of the binary format changes? I have implemented a c++ reader using protobuf, would update that if there is a new format spec. mike ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 7:34 PM, Erik Johansson erjo...@gmail.com wrote: On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com wrote: On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org wrote: Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? I'm curious about this as well. The main reason for me introducing the new project structure was to facilitate the integration of new features like this. They're relatively easy to add (some Ant and Ivy foo required ...), [...] The code hasn't changed a lot, but the build processes have. Well that's one of the thing Scott said he had no clue on how to do. From Scotts mail: Scott Crosby: // TODO's Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. I'll help incorporate this into the rest of Osmosis. There's a few things to work through though. - Is there a demand for the binary format in its current incantation? I'm not keen to incorporate it if nobody will use it. - Can the code be managed in the main OSM Subversion repo instead of GIT? - Is any code reuse between Osmosis and other applications required? If only the Osmosis tasks will be managed in the Osmosis project and a component with common functionality managed elsewhere then I need to know how the common component will be managed and published for consumption in Osmosis. Brett ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
What about some metrics (performance, size)? Data is the same, whether binary or not. So binary really has to pay off significantly. Am 01.08.10 13:39, schrieb Brett Henderson: On Sun, Aug 1, 2010 at 7:34 PM, Erik Johansson erjo...@gmail.com mailto:erjo...@gmail.com wrote: On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com mailto:br...@bretth.com wrote: On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org mailto:frede...@remote.org wrote: Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? I'm curious about this as well. The main reason for me introducing the new project structure was to facilitate the integration of new features like this. They're relatively easy to add (some Ant and Ivy foo required ...), [...] The code hasn't changed a lot, but the build processes have. Well that's one of the thing Scott said he had no clue on how to do. From Scotts mail: Scott Crosby: // TODO's Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. I'll help incorporate this into the rest of Osmosis. There's a few things to work through though. * Is there a demand for the binary format in its current incantation? I'm not keen to incorporate it if nobody will use it. * Can the code be managed in the main OSM Subversion repo instead of GIT? * Is any code reuse between Osmosis and other applications required? If only the Osmosis tasks will be managed in the Osmosis project and a component with common functionality managed elsewhere then I need to know how the common component will be managed and published for consumption in Osmosis. Brett ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 17:33, Andreas Kalsch andreaskal...@gmx.de wrote: What about some metrics (performance, size)? Data is the same, whether binary or not. So binary really has to pay off significantly. What performance metrics would you like that haven't already be covered earlier in this thread, and in the initial announcement? ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Hi, Brett Henderson wrote: I'll help incorporate this into the rest of Osmosis. There's a few things to work through though. * Is there a demand for the binary format in its current incantation? I'm not keen to incorporate it if nobody will use it. I run a nightly job at Geofabrik which currently operates on plain (uncompressed) OSM files and goes roughly like this (every step uses Osmosis): * apply daily diff to planet file * split planet file into continents * split each continent into countries * split some countries into smaller units * split some smaller units into even smaller units * bzip2 the lot The whole job takes from ~ 22h at night to ~ 9h in the morning, even though I'm ignoring the US. A lot of time is spent just reading from, and writing to, disk and parsing XML. Running the whole thing with .gz files doesn't make a big difference - saves some disk i/o, adds some CPU time, doesn't change XML parsing overhead. I wanted to test-drive the binary format as a replacement for raw .osm files in this setup, hoping that it would give me the i/o benefits of gzip compressed data but also slash XML parsing time. The numbers that have been posted seemed promising. I might even be able to skip the bzip2 step at the end if the binary format should become widely used, just placing binary files on the server; and use the saved time to re-introduce US extracts. So here's one user who's definitely in for it - the reason I asked right now was that I was planning to have a go at it in the near future, and wanted to make sure that I'm not using an old version or going down a path that everyone else already discarded. - If there's proper integration with Osmosis around the corner then I'd wait for that. The way I understood it, Scott was re-using some code he placed inside the Osmosis tree from within his splitter code. Also I could imagine that using this fance Google library means you'll have some format description files which might be shared across all projects using that library, perhaps even including the C++ reader that jamesmikedupont has built, but I'm not sure. I prefer SVN over git for the simple reason that I only have to svn up and everything is there but I'm sure it is going to be a matter of minutes before someone from Iceland points out that the same convenience can be had with git if one knows what they're doing ;) Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, 1 Aug 2010, Frederik Ramm wrote: A lot of time is spent just reading from, and writing to, disk and parsing XML. Running the whole thing with .gz files doesn't make a big difference - saves some disk i/o, adds some CPU time, doesn't change XML parsing overhead. I'm sorry but the parsing overhead from Java or libXML basically a known slowless factor. MSXML, pre/post plane parsing or even custom readers are not slow, and only limited to the disk. So the binary format, per se, is only faster because: - smaller filesize = less io - encoding: no xml rewriting Anything else is currently available using for example osmsucker.c, obviously not using an XML parser because all input is structured. If the binary format can pack our doubles (lat/lon), integers (version/ids) and makes strings available in UTF-8, that skips CPU and IO overhead. But makes the data not human readable. I can totally live with that, and I hope the API protocol also gets protocol buffers. Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Hi, Stefan de Konink wrote: I'm sorry but the parsing overhead from Java or libXML basically a known slowless factor. You don't have to be sorry, you're talking to the person who has patched osm2pgsql to parse XML with strcmp: http://trac.openstreetmap.org/browser/applications/utils/export/osm2pgsql/primitive_xml_parsing Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote: If the binary format can pack our doubles (lat/lon) lat/lon is stored as a double? I always use an int (and divide/multiply by 1000). http://wiki.openstreetmap.org/wiki/Database_schema Yeah, OSM seems to be doing the same thing. ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Hi, Anthony wrote: On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote: If the binary format can pack our doubles (lat/lon) lat/lon is stored as a double? I always use an int (and divide/multiply by 1000). The binary format seems to encode them as 64-bit integers but protocol buffers makes sure that bits are not wasted if unused. Also, in his original post about the binary format, Scott explained: If there is a batch of consecutive nodes to be output that have no tags at all, I use a special dense format. I omit the tags and store the group 'columnwise', as an array of ID's, array of latitudes, and array of longitudes, and delta-encode each column. This reduces header overheads and allows delta-coding to work very effectively. With the default ~1cm granularity, nodes within about 6 km of each other can be represented by as few as 7 bytes each plus the costs of the metadata if it is included. I hope that answers that. Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, 1 Aug 2010, Anthony wrote: On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote: If the binary format can pack our doubles (lat/lon) lat/lon is stored as a double? I always use an int (and divide/multiply by 1000). http://wiki.openstreetmap.org/wiki/Database_schema Yeah, OSM seems to be doing the same thing. OSM uses too many digits for 16bit numerics anyway ;) But I do hope that you don't expect your geometry engine to store this in an int ;) Stefan ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 21:24, Frederik Ramm frede...@remote.org wrote: I prefer SVN over git for the simple reason that I only have to svn up and everything is there but I'm sure it is going to be a matter of minutes before someone from Iceland points out that the same convenience can be had with git if one knows what they're doing ;) Why would you think that, I live in Germany now :) ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? Bye Frederik -- Frederik Ramm ## eMail frede...@remote.org ## N49°00'09 E008°23'33 ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org wrote: Scott, others, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. [...] The changes to osmosis are just some new tasks to handle reading and writing the binary format. [...] This was 3 months ago. What's the status of this project? Are people actively using it? Is it still being developed? Can the Osmosis tasks be used in the new Osmosis code architecture (see over on osmosis-dev) that Brett has introduced with 0.36? I'm curious about this as well. The main reason for me introducing the new project structure was to facilitate the integration of new features like this. They're relatively easy to add (some Ant and Ivy foo required ...), and can be removed later on if they're not maintained or people lose interest in them. If there's a demand for this binary format I'm happy to help integrate it as a new project into the existing codebase. I believe the existing version of this binary OSM format is implemented as a fork in a GIT repo so I suspect it will take some effort to update it to run against 0.36. The code hasn't changed a lot, but the build processes have. Brett ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hey, just wondering what the status on this project is? My app which could greatly benefit from this format is advancing to a usable state, but the main bottleneck for me right now is database size. My current format stores Texas in around 10G, which doesn't easily scale. If I can fit the entire globe into just over half that, well, that'd rock. :) Thanks. On 05/02/2010 04:25 AM, jamesmikedup...@googlemail.com wrote: Ok, my reader is now working, it can read to the end of the file, now am fleshing out the template dump functions to emit the data. g...@github.com:h4ck3rm1k3/OSM-Osmosis.git My new idea is that we could use a binary version of the rtree, I have already ported the rtree to my older template classes. We could use the rtree to sort the data and emit the blocks based on that. the rtree data structures themselves could be stored in the protobuffer so that it is persistent and also readable by all. https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg notes: http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042 doxygen here : http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote: -- Forwarded message -- (Accidently did not reply to list) Some of these questions may be a bit premature, but I don't know how far along your design is, and perhaps asking them now may influence that design in ways that work for me. I'm willing to call what I've designed so far in the file format mostly complete, except for some of the header design issues I've brought up already. The question is what extensions make sense to define now, such as bounding boxes, and choosing the right definition for them. Unfortunately, this method introduces a variety of complications. First, the database for TX alone is 10 gigs. Ballpark estimations are that I might need half a TB or more to store the entire planet. I'll also need substantial RAM to store the working set for the DB index. All this means that, to launch this project on a global scale, I'd need a lot more funding than I as an individual am likely to find. With pruning out metadata, some judicious filtering of uninteresting tags, and increasing the granularity to 10 microdegrees (about 1m resolution), I've fit the whole planet in 3.7gb. Is there a performance or size penalty to ordering the data geographically rather than by ID? I expect no performance penalty. As for a size penalty, it will be a mixed bag. Ordering geographically should reduce the similarity for node ID numbers, increasing the space required to store them. It should increase the similarity for latitude and longitude numbers, which would reduce the size. It might change the re-use frequency of strings. On the whole, I suspect the filesize would remain within 10% of what it is now and believe it will decrease, but I have no way to know. I understand that this won't be the default case, but I'm wondering if there would likely be any major performance issues for using it in situations where you're likely to want bounding-box access rather than simply pulling out entities by ID. I have no code for pulling entities out by ID, but that would be straightforward to add, if there was a demand for it. There should be no problems at all for doing geographic queries. My vision for a bounding box access is that the file lets you skip 'most' blocks that are irrelevant to a query. 'most' depends a lot on the data and how exactly the dataset is sorted for geographic locality. But there may be problems in geographic queries. Things like cross-continental airways if they are in the OSM planet file would cause huge problems; their bounding box would cover the whole continent, intersecting virtually any geographic lookup. Those geographic lookups would then need to find the nodes in those long ways which would require loading virtually every block containing nodes. I have considered solutions for this issue, but I do not know if problematic ways like this exist. Does OSM have ways like this. Also, is there any reason that this format wouldn't be suitable for a site with many active users performing geographic, read-only queries of the data? A lot of that depends on the query locality. Each block has to be indpendently decompressed and parsed before the contents can be examined, that takes around 1ms. At a small penalty in filesize, you can use 4k entities in a block which decompress and parse faster. If the client is interested in many ways in a particular geographic locality, as yours seems to, then this is perfect. Grab the blocks and cache the decompressed data in RAM where it can be re-used for subsequent geographic queries in the same locality. Again, I'd guess not, since the data isn't compressed as such, but maybe seeking several gigs into a
Re: [OSM-dev] New OSM binary fileformat implementation.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Sorry for the delay in responding; crazy life, and I've been fixing existing bugs in my project rather than thinking about breaking new ground. On 05/02/2010 12:35 AM, Scott Crosby wrote: With pruning out metadata, some judicious filtering of uninteresting tags, and increasing the granularity to 10 microdegrees (about 1m resolution), I've fit the whole planet in 3.7gb. Sweet. I hope this format works for my use case. I have no code for pulling entities out by ID, but that would be straightforward to add, if there was a demand for it. I would definitely need that. I'm coding to the travelingsalesman API's DataSet interface which does include retrieval by ID. have to pay a disk seek whether it is in my format or not. My format being very dense, might let RAM hold the working set and avoid the disk seek. 1ms to decompress is already far faster than a hard drive, though not a SSD. Keeping everything in RAM is probably workable. At the very least, to go global with a format like this would seem to be a matter of starting with a mid-level VPS that stores everything on disk and eventually upgrading to a high-RAM, low disk space EC2 or GoGrid instance. Without it, I'm looking at half a TB of storage and possibly a significant chunk of RAM, and even so I don't think my current dataset can handle that. In other words, I like the option of keeping everything in RAM far better than what I'm doing right now. :) Could you tell me more about the kinds of lookups your application will do? Sure. You can see the interface I've implemented here: http://travelingsales.svn.sourceforge.net/viewvc/travelingsales/trunk/libosm/src/org/openstreetmap/osm/data/IDataSet.java?view=markup Basically, the executive summary is that there are four broad kinds of lookups: Entity by ID, as mentioned earlier Entities based on intersection with bounding box, currently done by the somewhat inaccurate method of finding all contained nodes, then returning any associated ways/relations. Would be great if I could locate contained ways even if they don't have a node in the box, but even if not, it'd be no worse than what's there now. :) Entities by presence of certain tags, in some instances also with bounding box conditions (I.e. all amenity-fuel nodes, or all of such nodes within a given bounds) Nearest entity to a given point, expanding outward. I can, for instance, roughly find the nearest way by finding the node nearest to a set of coordinates, checking for its presence in any ways, then finding the next nearest and recursing outward until the conditions are met. The conditions check is done externally, so the search need only return the nearest entity, next nearest, etc.) I know you've said elsewhere that you don't want this format to replace the need for a database, and I respect that. I just don't quite know where that line is. Even so, I clearly don't need all of my database's functionality for the OSM-facing aspects of this app and hope that these limited uses are in scope. Thanks for thinking about and working on these issues. :) -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkvi6r4ACgkQIaMjFWMehWJvigCfV6d+2UY/5Mm1HCHquTMOG5Ru h50An0DeN8y+ADCBsVLw1V4w0xt+nql1 =wJIc -END PGP SIGNATURE- ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
I have now reworked the reader code to use only the internal google protobuffer features for reading. It now can read the entire file to the end. Code is commited. Also, I have checked in an example small protobuffer file for testing. http://github.com/h4ck3rm1k3/OSM-Osmosis any testers would be appreciated. in the dir OSM-Osmosis/src/crosby/binary/cpp, build and then run : ./osmprotoread albania.osm.protobuf3 out.txt mike On Sun, May 2, 2010 at 11:25 AM, jamesmikedup...@googlemail.com jamesmikedup...@googlemail.com wrote: Ok, my reader is now working, it can read to the end of the file, now am fleshing out the template dump functions to emit the data. g...@github.com:h4ck3rm1k3/OSM-Osmosis.git My new idea is that we could use a binary version of the rtree, I have already ported the rtree to my older template classes. We could use the rtree to sort the data and emit the blocks based on that. the rtree data structures themselves could be stored in the protobuffer so that it is persistent and also readable by all. https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReghttps://code.launchpad.net/%7Ejamesmikedupont/+junk/EPANatReg notes: http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042 doxygen here : http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote: -- Forwarded message -- (Accidently did not reply to list) Some of these questions may be a bit premature, but I don't know how far along your design is, and perhaps asking them now may influence that design in ways that work for me. I'm willing to call what I've designed so far in the file format mostly complete, except for some of the header design issues I've brought up already. The question is what extensions make sense to define now, such as bounding boxes, and choosing the right definition for them. Unfortunately, this method introduces a variety of complications. First, the database for TX alone is 10 gigs. Ballpark estimations are that I might need half a TB or more to store the entire planet. I'll also need substantial RAM to store the working set for the DB index. All this means that, to launch this project on a global scale, I'd need a lot more funding than I as an individual am likely to find. With pruning out metadata, some judicious filtering of uninteresting tags, and increasing the granularity to 10 microdegrees (about 1m resolution), I've fit the whole planet in 3.7gb. Is there a performance or size penalty to ordering the data geographically rather than by ID? I expect no performance penalty. As for a size penalty, it will be a mixed bag. Ordering geographically should reduce the similarity for node ID numbers, increasing the space required to store them. It should increase the similarity for latitude and longitude numbers, which would reduce the size. It might change the re-use frequency of strings. On the whole, I suspect the filesize would remain within 10% of what it is now and believe it will decrease, but I have no way to know. I understand that this won't be the default case, but I'm wondering if there would likely be any major performance issues for using it in situations where you're likely to want bounding-box access rather than simply pulling out entities by ID. I have no code for pulling entities out by ID, but that would be straightforward to add, if there was a demand for it. There should be no problems at all for doing geographic queries. My vision for a bounding box access is that the file lets you skip 'most' blocks that are irrelevant to a query. 'most' depends a lot on the data and how exactly the dataset is sorted for geographic locality. But there may be problems in geographic queries. Things like cross-continental airways if they are in the OSM planet file would cause huge problems; their bounding box would cover the whole continent, intersecting virtually any geographic lookup. Those geographic lookups would then need to find the nodes in those long ways which would require loading virtually every block containing nodes. I have considered solutions for this issue, but I do not know if problematic ways like this exist. Does OSM have ways like this. Also, is there any reason that this format wouldn't be suitable for a site with many active users performing geographic, read-only queries of the data? A lot of that depends on the query locality. Each block has to be indpendently decompressed and parsed before the contents can be examined, that takes around 1ms. At a small penalty in filesize, you can use 4k entities in a block which decompress and parse faster. If the client is interested in many ways in a particular geographic locality, as yours seems to, then this is perfect. Grab the blocks and cache the
Re: [OSM-dev] New OSM binary fileformat implementation.
Ok, my reader is now working, it can read to the end of the file, now am fleshing out the template dump functions to emit the data. g...@github.com:h4ck3rm1k3/OSM-Osmosis.git My new idea is that we could use a binary version of the rtree, I have already ported the rtree to my older template classes. We could use the rtree to sort the data and emit the blocks based on that. the rtree data structures themselves could be stored in the protobuffer so that it is persistent and also readable by all. https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg notes: http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042 doxygen here : http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote: -- Forwarded message -- (Accidently did not reply to list) Some of these questions may be a bit premature, but I don't know how far along your design is, and perhaps asking them now may influence that design in ways that work for me. I'm willing to call what I've designed so far in the file format mostly complete, except for some of the header design issues I've brought up already. The question is what extensions make sense to define now, such as bounding boxes, and choosing the right definition for them. Unfortunately, this method introduces a variety of complications. First, the database for TX alone is 10 gigs. Ballpark estimations are that I might need half a TB or more to store the entire planet. I'll also need substantial RAM to store the working set for the DB index. All this means that, to launch this project on a global scale, I'd need a lot more funding than I as an individual am likely to find. With pruning out metadata, some judicious filtering of uninteresting tags, and increasing the granularity to 10 microdegrees (about 1m resolution), I've fit the whole planet in 3.7gb. Is there a performance or size penalty to ordering the data geographically rather than by ID? I expect no performance penalty. As for a size penalty, it will be a mixed bag. Ordering geographically should reduce the similarity for node ID numbers, increasing the space required to store them. It should increase the similarity for latitude and longitude numbers, which would reduce the size. It might change the re-use frequency of strings. On the whole, I suspect the filesize would remain within 10% of what it is now and believe it will decrease, but I have no way to know. I understand that this won't be the default case, but I'm wondering if there would likely be any major performance issues for using it in situations where you're likely to want bounding-box access rather than simply pulling out entities by ID. I have no code for pulling entities out by ID, but that would be straightforward to add, if there was a demand for it. There should be no problems at all for doing geographic queries. My vision for a bounding box access is that the file lets you skip 'most' blocks that are irrelevant to a query. 'most' depends a lot on the data and how exactly the dataset is sorted for geographic locality. But there may be problems in geographic queries. Things like cross-continental airways if they are in the OSM planet file would cause huge problems; their bounding box would cover the whole continent, intersecting virtually any geographic lookup. Those geographic lookups would then need to find the nodes in those long ways which would require loading virtually every block containing nodes. I have considered solutions for this issue, but I do not know if problematic ways like this exist. Does OSM have ways like this. Also, is there any reason that this format wouldn't be suitable for a site with many active users performing geographic, read-only queries of the data? A lot of that depends on the query locality. Each block has to be indpendently decompressed and parsed before the contents can be examined, that takes around 1ms. At a small penalty in filesize, you can use 4k entities in a block which decompress and parse faster. If the client is interested in many ways in a particular geographic locality, as yours seems to, then this is perfect. Grab the blocks and cache the decompressed data in RAM where it can be re-used for subsequent geographic queries in the same locality. Again, I'd guess not, since the data isn't compressed as such, but maybe seeking several gigs into a file to locate nearby entities would be a factor, or it may work just fine for single-user access but not so well with multiple distinct seeks for different users in widely separate locations. Ultimately, it depends on your application, which has a particular locality in its lookups. Application locality, combined with a fileformat, defines the working set size. If RAM is insufficient to hold the working set, you'll have to pay a disk seek whether it is in my format or not. My
Re: [OSM-dev] New OSM binary fileformat implementation.
Initial thoughts from my end (as main developer of Osmosis) is that it would be best to keep it separate until you've let it evolve somewhat. If you can develop it as a plugin for now it will let us (ie. Osmosis core, and osmosis bin format) remain independent until it matures and stabilises. If it has a wide enough audience (ie. useful to more than a small handful of people) then we can look at incorporating it into the core of Osmosis. At that point it will need to meet the current Osmosis code quality checks (eg. checkstyle), have unit tests, and pass a code review. None of that is too difficult, but I need to ensure it doesn't add a maintenance burden. I agree wholeheartedly with letting it evolve and I'm interested in hearing yours (and others) thoughts on what additional features to include or exclude. I can possibly help with some ant and ivy advice in terms of how to incorporate it with Osmosis. It should be possible to make the existing osmosis library an Ivy dependency of your library. It might make sense to split your library in two parts, the generic re-usable code, and the Osmosis specific tasks. But I'm only speculating until I understand what you've done. Thanks, I'd appreciate that advice. The code is already segregated into osmosis independent and osmosis dependent packages; the independent code is already reused in my patches to the mkgmap splitter to read the binary files. Longer term I'm open to suggestions on how Osmosis should incorporate new features like this. It may actually make more sense to pull some existing features out of Osmosis and make them plugins also, or it may be simpler to just keep adding to the core. But these questions are more suited to the osmosis-dev list. You've got a nice architecture; it was very easy to figure out how to get my code to tie-in to your design, at least as a built in. Making it work as a plugin will be more effort. My biggest wish is that there should be a common library used by the different OSM software for representing lowlevel concepts such as points, bounding boxes, entities, etc, or at least define an baseline interface that the different OSM software offers for accessing their different entity implementations. I could target that. mkgmap, osmosis, the splitter, and josm all have 'Node' and 'Way' defined in semantically identical fashions, yet are incompatible types. To support all 4 of them requires reimplementing the binary serializer and deserializer 4 times, quadrupling the chances of having a bug. Such an interface would make it possible to have a shared XML-entity parser/writer implementation and make code more reusable across these applications. For instance, I'm interested in reusing josm's quadtree implementation in osmosis. I have a local patch that refactors it to support anything that has a getBBox() and isUsable() functions, but using it requires importing josm's notion of bounding boxes and coordinates into osmosis along with refactoring out dependencies on java.awt.geom.Rectangle2D, and org.openstreetmap.josm.Main (via LatLon). Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 05/01/2010 09:47 AM, Scott Crosby wrote: I agree wholeheartedly with letting it evolve and I'm interested in hearing yours (and others) thoughts on what additional features to include or exclude. Some of these questions may be a bit premature, but I don't know how far along your design is, and perhaps asking them now may influence that design in ways that work for me. I'm developing an accessible map-browsing, GPS navigation app. You can read my initial blog post on the project here: http://thewordnerd.info/2010/03/introducing-hermes/ At the moment, this uses LibOSM from travelingsalesman and an as-of-yet-unreleased dataset using MongoDB for the geospatial queries. I don't really understand enough higher-level math to roll my own geospatial code, especially since I can't visually verify the results, so it's easier to use LibOSM and roll a dataset that I can run on a production site than it would be to re-invent the wheel. Unfortunately, this method introduces a variety of complications. First, the database for TX alone is 10 gigs. Ballpark estimations are that I might need half a TB or more to store the entire planet. I'll also need substantial RAM to store the working set for the DB index. All this means that, to launch this project on a global scale, I'd need a lot more funding than I as an individual am likely to find. I'm really excited to read your numbers for compression, because at first glance, this would seem to take the project from something that I'd need substantial EC2 infrastructure for, to something I can run on a mid-level VPS, slashing costs from $1000+month to $50 or so/month. So my questions: Is there a performance or size penalty to ordering the data geographically rather than by ID? I understand that this won't be the default case, but I'm wondering if there would likely be any major performance issues for using it in situations where you're likely to want bounding-box access rather than simply pulling out entities by ID. Also, is there any reason that this format wouldn't be suitable for a site with many active users performing geographic, read-only queries of the data? Again, I'd guess not, since the data isn't compressed as such, but maybe seeking several gigs into a file to locate nearby entities would be a factor, or it may work just fine for single-user access but not so well with multiple distinct seeks for different users in widely separate locations. Anyhow, I realize these questions may be naive at such an early stage, but the idea that I may be able to pull this off without infrastructure beyond my budget is an appealing one. Are there any reasons your binary format wouldn't be able to accomodate this situation, or couldn't be optimized to do so? Thanks. -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkvcR80ACgkQIaMjFWMehWLAngCcDTYdjW6SrKaPoKdqjjEY4r3U C34AnR4f8NEM18Z07Xr9vjli8/6UFYCz =feGc -END PGP SIGNATURE- ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
Sounds cool! What indice exist? Can I access elements by ID? what about access by bounding-box? Are there navigatable back-references from Members to relations and from Nodes to Ways? Marcus ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Thu, Apr 29, 2010 at 2:15 AM, Frederik Ramm frede...@remote.org wrote: Scott, Scott Crosby wrote: I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. This all sounds very interesting, and you seem to have spent a lot of thought on it and documented it well. If I understand it correctly, this is meant to be a replacement for the XML files as a transport format for XML data. It is not meant to offer random access in any way, and thus differs from other attempts at creating binary formats that could be used in lieu of databases, having indexes and all. Roughly, yes, this is intended as a transport format, but the design is flexible. if the file is physically ordered so that blocks have strong geographic locality and block metadata includes bounding boxes, then those bounding boxes can be used to skip unneeded blocks. If the file is physically ordered in type/id order (as the current planet is), and block metadata includes the minimum and maximum id for each block, then, as before, the metadata can be used to only examine desired blocks. If both searches are critical, then generate two files, one with geographic locality and one sorted by type/id. Storing 'planet-omitmeta.bin' *AND* 'planet.bin' is still cheaper than storing 'planet-100303.osm.gz' As you point out, a binary format means different things to different people. I chose this design because it would be useful as-is and it could offer future features without requiring changing the file format. Geographic searches and searches by typeid merely wait on implementing code for physically reordering the file to be written. Adding the appropriate fields to the metadata header is trivial in comparison. In addition, nothing in the design precludes adding having fileblocks that contain data other than OSM entities. Fileblocks can contain an index from a node ID to the ways and relations it is contained. Metadata headers on these blocks can indicate which block contains the index entries for a particular node. My vision is that in many cases it is better to have a simple format that is very dense and lets you skip the 95% of the data that you don't care about, rather than design a very complex or significantly larger format (e.g, a relational database) that lets you skip 99+% of the data that you don't care about. The more advanced formats may return less data, but the simple format is still 20 times less data than reading everything. Maybe we should be careful about naming these formats to make their purpose clearer. The generic OSM binary format seems to mean different things to different people. The file extension .bin is perhaps not the best choice. Have you considered/evaluated Fast Infoset and if so, what were the reasons against that? No, I was not aware of that compressed XML design. It is 5x-10x faster at reading and writing and 30-50% smaller The size figure is obviously compared to bz2; is the 5x-10x faster also compared to bz2, and if so, compared to the native Java bz2 or the external C one? For filesizes, I was comparing to bzip2. For performance, I compared against the gzip'ed planet; I didn't have the patience to compress or decompress that much XML in bzip2. an entire planet, including all metadata, can be read in about 12 minutes and written in about 50 minutes on a 3 year old dual-core machine. How did you measure write performance decoupled from read performance? Surely your 3 year old dual-core machine did not have the 150 gigs of RAM needed to suck the entire planet into memory? I benchmarked: osmosis --read-bin file=planet.bin --write-null osmosis --read-bin file=planet.bin --write-bin file=planet2.bin And measured ~12 minutes of CPU time for the first and ~60 minutes of CPU time for the second. With a dual-core system, using '--b bufferCapacity=2' gives some concurrency and writing can be done in somewhere around 40 minutes. You have paid an impressive amount attention to details in order to achieve the good performance and compression rates that you do. I'm slightly concerned about the robustness of it all - in the past, we often had planet files that were broken one way or the other, and it was usually possible to remedy this with some standard grep, sed, or dd actions - if one of your files ever breaks then I guess it is likely to be complete garbage ;-) Without knowing how those prior planets were broken, I can't say whether analogous breakage of files in my format could be repaired. Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. I suggest to ask on osmosis-dev, an get your new code into the Osmosis
Re: [OSM-dev] New OSM binary fileformat implementation.
On Thu, Apr 29, 2010 at 2:16 AM, jamesmikedup...@googlemail.com jamesmikedup...@googlemail.com wrote: HI, I am working on the c++ decoder, can you please tell me roughly where the message objects are? (CC'ing back to the list) I need to understand how to decode this file. Do you have any header information or is it all raw protobuf? You mentioned blocks being compressed, are you using the protobuf tools to do this? can you give me a rough layout of how to parse the bin file? Parsing a protocol buffer requires knowing its length beforehand, which adds in some complexities. Currently a file consists of repetitions of the following: 32-bit integer encoding the header length, I believe it is in Java's default big endian order serialized FileBlockHeader message (see fileformat.proto) serialized Blob message' (see fileformat.proto, the length is given in the header message) Code implementing this is in BlockInputStream and BlockOutputStream and FileBlock in package crosby.binary.file. To understand how compression is used, look at the definition of the Blob message and FileBlock.java Inside the blob is a serialized osm HeaderBlock or osm PrimitiveBlock message. Note that I am currently unhappy with the current file header scheme and may make incompatible changes. There's definitely some cruft in fileformat.proto that needs to be removed. thanks, mike ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Thu, Apr 29, 2010 at 10:15 PM, Scott Crosby scrosb...@gmail.com wrote: Probably the most important TODO is packaging and fixing the build system. I have no almost no experience with ant and am unfamiliar with java packaging practices, so I'd like to request help/advice on ant and suggestions on how to package the common parsing/serializing code so that it can be re-used across different programs. I suggest to ask on osmosis-dev, an get your new code into the Osmosis trunk quickly so people can play with it. I think it would be prudent to get suggestions from the OSM community first. Once the code is in osmosis, our ability to make compatability-breaking changes to the format will be reduced. Hi Scott, Thanks for all the great work. It will be a little while (at least several days) before I can take a look at what you've done. Initial thoughts from my end (as main developer of Osmosis) is that it would be best to keep it separate until you've let it evolve somewhat. If you can develop it as a plugin for now it will let us (ie. Osmosis core, and osmosis bin format) remain independent until it matures and stabilises. If it has a wide enough audience (ie. useful to more than a small handful of people) then we can look at incorporating it into the core of Osmosis. At that point it will need to meet the current Osmosis code quality checks (eg. checkstyle), have unit tests, and pass a code review. None of that is too difficult, but I need to ensure it doesn't add a maintenance burden. I can possibly help with some ant and ivy advice in terms of how to incorporate it with Osmosis. It should be possible to make the existing osmosis library an Ivy dependency of your library. It might make sense to split your library in two parts, the generic re-usable code, and the Osmosis specific tasks. But I'm only speculating until I understand what you've done. Longer term I'm open to suggestions on how Osmosis should incorporate new features like this. It may actually make more sense to pull some existing features out of Osmosis and make them plugins also, or it may be simpler to just keep adding to the core. But these questions are more suited to the osmosis-dev list. Brett ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby scrosb...@gmail.com wrote: Hello! I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. It is 5x-10x faster at reading and writing and 30-50% smaller; an entire planet, including all metadata, can be read in about 12 minutes and written in about 50 minutes on a 3 year old dual-core machine. I have implemented an osmosis reader and writer and have enhancements to the map splitter to read the format. Code is pure Java and uses Google protocol buffers for the low-level serialization. Thats very interesting. Would like to see this working with c++ as well. Will have to look at the code. thanks for sharing, mike ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On Wed, Apr 28, 2010 at 1:16 PM, OJ W ojwli...@googlemail.com wrote: where's the .proto file? Proto files should be in the osmosis git repository at: src/crosby/binary/fileformat.proto src/crosby/binary/osmformat.proto do you have data files in this format available to download? No, I do not have any files at this time as I am not ready to declare the file format as being stable. You can make your own test files with --write-bin. Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
I have a quick question: does the format support inserting data to an existing file? Or is it just some binary serialization? The format is a binary serialization design and does not support random-access read and write semantics. For that a database is probably more suitable. However, some changes can be done to a file relatively cheaply. Data can be trivially appended. Rewriting a file could be fairly cheap as each fileblock is independently decodable and contains only 8,000 OSM entities. A fileblock can be copied from an input to an output without decompressing or parsing it. Metadata in the block header could be used to find out which fileblocks can be copied unchanged, or used to filter out unwanted blocks. Now we just need the dump tool for the database to create some planet dump file in your format. If osmosis is used as the dump tool, I believe a --write-bin should suffice to make a planet dump. The code just ties into the existing Source/Sink osmosis architecture. Scott ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On 28/04/10 20:07, Scott Crosby wrote: Now we just need the dump tool for the database to create some planet dump file in your format. If osmosis is used as the dump tool, I believe a --write-bin should suffice to make a planet dump. The code just ties into the existing Source/Sink osmosis architecture. Osmosis isn't the dump tool, no. The planetdump program is. If we were going to offer a binary version for download the it would be better to generate it from the xml one anyway, rather than from the database, so that the two versions are consistent. Tom -- Tom Hughes (t...@compton.nu) http://compton.nu/ ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Op 28-04-10 21:50, Tom Hughes schreef: Osmosis isn't the dump tool, no. The planetdump program is. If we were going to offer a binary version for download the it would be better to generate it from the xml one anyway, rather than from the database, so that the two versions are consistent. Since the current XML version is inconsistent, a direct database dump will be more consistent than any conversion. Matt received a lot of examples for inconsistencies already. Stefan -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEAREKAAYFAkvYlLAACgkQYH1+F2Rqwn3s7wCfb+I6iG7ieZEXXycojP9BJNgZ UdwAmgK06M/9tESSIxsypsV7XjtWpw2S =d5pO -END PGP SIGNATURE- ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On 28 April 2010 21:04, Stefan de Konink ste...@konink.de wrote: Since the current XML version is inconsistent, a direct database dump will be more consistent than any conversion. Matt received a lot of examples for inconsistencies already. Since: http://trac.openstreetmap.org/changeset/20396 ? / Grant ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
-BEGIN PGP SIGNED MESSAGE- Hash: SHA512 Op 28-04-10 22:11, Grant Slater schreef: On 28 April 2010 21:04, Stefan de Konink ste...@konink.de wrote: Since the current XML version is inconsistent, a direct database dump will be more consistent than any conversion. Matt received a lot of examples for inconsistencies already. Since: http://trac.openstreetmap.org/changeset/20396 ? Did that fix *all* the older inconsistencies? I mean did you run a query to verify all referential constraints on the current table? Stefan -BEGIN PGP SIGNATURE- Version: GnuPG v2.0.14 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEAREKAAYFAkvYmkcACgkQYH1+F2Rqwn2AxQCdHjZt9SI6gpKlqHOKwb7BXBYQ FHIAnicg9/c3qOTOgUjJiyT4HhPvG7nG =EJdz -END PGP SIGNATURE- ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
I got it built, Build instructions : first get the protobuf, and build manually, 526 svn checkout http://protobuf.googlecode.com/svn/trunk/ protobuf-read-only 527 ls 528 cd protobuf-read-only/ 529 ls 530 bash ./autogen.sh 531 ls 532 make 533 ./configure 534 make 535 sudo make install go into the java dir and use ant to build the jar files. got to the sources of the OSM tool : install the protobuf into the lib/compile, normally this will be handled by ivy, but there are issues. cp /home/mdupont/experiments/osm/protobuf-read-only/java/target/protobuf-java-2.3.1-pre.jar lib/compile/ 607 cd src/ 611 cd crosby/ 613 cd binary/ 614 ls 615 protoc --java_out=../.. fileformat.proto 616 protoc --java_out=../.. osmformat.proto this generates the needed code. now you can build. On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby scrosb...@gmail.com wrote: Hello! I would like to announce code implementing a binary OSM format that supports the full semantics of the OSM XML. It is 5x-10x faster at reading and writing and 30-50% smaller; an entire planet, including all metadata, can be read in about 12 minutes and written in about 50 minutes on a 3 year old dual-core machine. I have implemented an osmosis reader and writer and have enhancements to the map splitter to read the format. Code is pure Java and uses Google protocol buffers for the low-level serialization. Comparing the file sizes: 8.2gb planet-100303.osm.bz2 12 gb planet-100303.osm.gz 5.2gb planet-omitmeta.bin 6.2gb planet.bin The omitmeta version omits the uid/user/version/timestamp metadata fields on each entity and are faster to generate and read. The design is very extensible. The low-level file format is designed to support random access at the 'fileblock' granularity, where a fileblock can contain ~8k OSM entities. There is *no* tag hardcoding used; all keys and values are stored in full as opaque strings. For future scalability, 64 bit node/way/relation ID's are assumed. The current serializer preserves the order of OSM entities and tags on OSM entities. To flexibly handle multiple resolutions, the granularity, or resolution used for representing locations and timestamps is adjustable in multiples of 1 millisecond and 1 nanodegree and can be set independently for each fileblock. The default scaling factor is 1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at the equator. These are the current resolution of the OSM database. Smaller files can be generated. At 10 microdegrees granularity, corresponding to about 1m of resolution, the filesize decreases by about 1gb. Space may also be saved by removing uninteresting UUID tags or perhaps by having stronger geographic locality when building the file. I have also tested the binary format on some SRTM contour lines in OSM 0.5 XML format, obtaining about a 50:1 compression ratio. This might be further improved by choosing a granularity equal to the isohypsis grid size. // Testing I have tested this code on the Cloudmade extract of Rhode Island. After converting the entire file to and from binary format, the XML output is bytewise identical to original file except for the one line indicating the osmosis version number. When run through the splitter, the output is not bytewise identical to before because of round-off errors 16 digits after the decimal point; this could be fixed by having the splitter behave like osmosis and only output 7 significant digits. // To use: Demonstration code is available on github at http://github.com/scrosby/OSM-Osmosis and http://github.com/scrosby/OSM-splitter See the 'master' branches. Please note that this is at present unpackaged demonstration code and the fileformat may change to incorporate suggestions. Also note that the shared code between the splitter and osmosis currently lives in the osmosis git repository. You'll also need to go into the crosby.binary directory and run the protocol compiler ('protoc') on the .proto files (See comments in those files for the command line.). /// The design /// I use Google protocol buffers for the low-level store. Given a specification file of one or more messages, the protocol buffer compiler writes low-level serialization code. Messages may contain other messages, forming hierarchical structures. Protocol buffers also support extensibility; new fields can be added to a message and old clients can read those messages without recompiling. For more details, please see http://code.google.com/p/protobuf/. Google officially supports C++, Java, and Python, but compilers exist for other languages. An example message specification is: message Node { required sint64 id = 1; required sint64 lat = 7; required sint64 lon = 8; repeated uint32 keys = 9 [packed = true]; // Denote strings repeated uint32 vals = 10 [packed = true];// Denote strings
Re: [OSM-dev] New OSM binary fileformat implementation.
I have created a branch with the cpp generated code, g...@github.com:h4ck3rm1k3/OSM-Osmosis.git the new c++ lib is called libosmprotobuf, what a great name. OSM-Osmosis/src/crosby/binary/ run make to generate the code, but i checked in the results. in the subdir : OSM-Osmosis/src/crosby/binary/cpp bash ./autogen.sh ./configure make it uses /usr/local/lib/libprotobuf.la which is a bit of a hack. my next step will be to hook this up to my existing c++ code https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg I hope that I will be able to make nice small C++ tools that can then process and or produce these buffer files. very interesting stuff that google has produced, thanks scott for making this public, mike ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev
Re: [OSM-dev] New OSM binary fileformat implementation.
On 28 April 2010 21:27, Stefan de Konink ste...@konink.de wrote: Since the current XML version is inconsistent, a direct database dump will be more consistent than any conversion. Matt received a lot of examples for inconsistencies already. Since: http://trac.openstreetmap.org/changeset/20396 ? Did that fix *all* the older inconsistencies? I mean did you run a query to verify all referential constraints on the current table? Well if there are any inconsistencies; don't keep them secret. Lets get them fixed. :-) / Grant ___ dev mailing list dev@openstreetmap.org http://lists.openstreetmap.org/listinfo/dev