Re: [OSM-dev] New OSM binary fileformat implementation.

2010-09-17 Thread Frederik Ramm

Mike,

jamesmikedup...@googlemail.com wrote:

Hi, ist there any documentation of the binary format changes?
I have implemented a c++ reader using protobuf, would update that if
there is a new format spec.


It would be great if you could check whether your reader still works 
with the current implementation, and then I'd be extremely grateful for 
some sort of minimal package that contains only your reader and the 
stuff absolutely necessary to build it - I've checked out your 
http://github.com/h4ck3rm1k3/OSM-Osmosis but ended up with a tree that 
contained half (or al?) of Osmosis and lots of autoconf cruft but wasn't 
buildable for me because it expected Google protobuf stuff to be 
downloaded and installed separately and I didn't know what to get and 
where to install it!


Background is, I would like to add binary format support to osm2pgsql 
and was hoping to be able to use your code for that.


Scott, it would be great if apart from a name for the new binary format 
you'd also recommend a default file extension since some of our tools 
try to auto-detect the file format from the name (.osm.bz2, .osm.gz, 
.osm - maybe .osm.bin for the new stuff?).


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-09 Thread Scott Crosby
On Wed, Apr 28, 2010 at 12:02 PM, Scott Crosby scrosb...@gmail.com wrote:
 Hello!

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML. It is 5x-10x faster at
 reading and writing and 30-50% smaller; an entire planet, including
 all metadata, can be read in about 12 minutes and written in about 50
 minutes on a 3 year old dual-core machine. I have implemented an
 osmosis reader and writer and have enhancements to the map splitter to
 read the format. Code is pure Java and uses Google protocol buffers
 for the low-level serialization.

 Comparing the file sizes:

  8.2gb   planet-100303.osm.bz2
 12  gb   planet-100303.osm.gz
  5.2gb   planet-omitmeta.bin
  6.2gb   planet.bin


Some newer results.

I have a modification to dense nodes to support storing tags. This
results in a 500mb smaller files for an entire planet. Sizes are now:

  4.7gb  planet-omitmeta.bin
  5.7gb  planet.bin

Results when dropping the resolution to 1m precision:
 3.8gb planet-granularity=1-omitmeta.bin

This reduced resolution format may be a good choice for distributing
OSM snapshots to non-editors.

I have tested the new file on the cloudmade extract of rhode_island;
converting 122MB of uncompressed XML to and from binary format. The
result is bytewise identical to the source file except for the osmosis
version number at the top.

Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-09 Thread Scott Crosby
On Wed, Aug 4, 2010 at 7:17 PM, Brett Henderson br...@bretth.com wrote:
 On Tue, Aug 3, 2010 at 11:37 PM, Scott Crosby scro...@cs.rice.edu wrote:

 On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote:


 If we go down this path I need two things:

 1. A versioned jar file containing all re-usable code.  Scott, can you take
 care of this?

I have split off the reusable code into a separate library, distinct
from the osmosis-only code which is currently sitting in my osmosis
git repository (published to github).  I have created a git repo for
the reusable code at  http://github.com/scrosby/OSM-binary  Note that
the history is messy, so I will be rebasing that repository.

How do I configure this project into building correctly and making a
versioned jar file?

 2. The Osmosis specific code that I can use in a new osmbin project within
 the Osmosis Subversion repo.  I can probably get them from GIT if you let me
 know which files I need.

I have a working version of the plugin published to my osmosis github
mirror. I duct-taped it together by hacking osmosis_plugins.conf, but
the binary plugin is working on trunk.

Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-09 Thread Scott Crosby
On Sun, Aug 1, 2010 at 5:51 AM, jamesmikedup...@googlemail.com
jamesmikedup...@googlemail.com wrote:
 Hi, ist there any documentation of the binary format changes?
 I have implemented a c++ reader using protobuf, would update that if
 there is a new format spec.
 mike

No real docs. There are some tweaks on the edges with regards to
renaming protocol buffer message names and field names. Mostly
searchreplace. Field numbers may have changed, so the earliest files
are not compatible.

There are a few semantic differences that affect parsing. Offset
numbers are now in the format that let the grid in the binary file be
aligned with a grid in a dataset for datasets with a regular grid,
such as isohypsis files. The other notable change is I have have
extended DenseNodes to support tags and so have removed the former
Node. This has resulted in '0' no longer being available for use as a
string identifier. It is being used as a delimiter.

In a week or two, depending on feedback and any resulting changes, I
will upload reference files to github. If you update the reader, I
would appreciate a copy,

Thanks,
Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-04 Thread Brett Henderson
On Tue, Aug 3, 2010 at 11:37 PM, Scott Crosby scro...@cs.rice.edu wrote:

 On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote:

  I'll help incorporate this into the rest of Osmosis.  There's a few
 things
  to work through though.

 I don't have a lot of time to work with this, but I can split up my
 working branch (which includes several unrelated changes) into
 separate orthogonal pieces. Git is *VERY* good at this. That would
 simplify integration.

 
  Is there a demand for the binary format in its current incantation?  I'm
 not
  keen to incorporate it if nobody will use it.

 I think it would be used in the mkgmap splitter, if available.

  Can the code be managed in the main OSM Subversion repo instead of GIT?

 Yes. I use git personally, but there's very good SVN integration.

  Is any code reuse between Osmosis and other applications required?

 Yes.

   The *.proto files must be shared with other applications that use
 the binary format, including C/Java/Python/.net/

I wrote some java parser code that is intended to be shared across
 the other Java osmosis applications. (Eg, I'm using it in my splitter
 changes.) in crosby/binary/file and crosby/binary/*.java

I suggest that all of this be put in a separate library along with
 jamesmikedupont's C/C++ code.


Currently Osmosis is split into a number of sub-projects.  For example,
there's xml, apidb, pgsql, etc.  This would be a new project, something like
osmbin although that's a fairly generic name.  But presumably we'd only be
putting the Osmosis specific stuff in there.  The osmbin project would need
to have a dependency on an external lib that contains your re-usable code.
That is the tricky bit.

Osmosis currently retrieves external dependencies from the public maven
repository at repo1.maven.org.  The few libraries that aren't available
there are checked in directly into the Osmosis managed repository stored in
the build-support/repo directory.

The simplest way to solve this is to create your third party library through
whatever means you wish, then we check in the resultant (properly versioned)
jar file into the Osmosis build-support/repo Ivy repository.  Then the
osmbin project can pull that lib in as a dependency and do a build.

If we go down this path I need two things:
1. A versioned jar file containing all re-usable code.  Scott, can you take
care of this?
2. The Osmosis specific code that I can use in a new osmbin project within
the Osmosis Subversion repo.  I can probably get them from GIT if you let me
know which files I need.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-03 Thread Scott Crosby
On Sat, Jul 31, 2010 at 11:26 AM, Frederik Ramm frede...@remote.org wrote:
 Scott, others,

 Scott Crosby wrote:

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML.

 [...]

 The changes to osmosis are just some new tasks to handle reading and
 writing the binary format.

 [...]

 This was 3 months ago.

 What's the status of this project? Are people actively using it? Is it still
 being developed? Can the Osmosis tasks be used in the new Osmosis code
 architecture (see over on osmosis-dev) that Brett has introduced with 0.36?

I'm using it personally. I know of no other users, except that Nolan
Darilek is interested in whether the format can be expanded with
geographic indexing information. I have a few minor tweaks that I've
been intending to make before declaring the format final. Basically,
definining some optional fileformat fields (eg, is the file sorted?
And on what paramater.) There's no infrastructure using these fields,
however.

How much interest is there in this code and format?

Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-03 Thread Scott Crosby
On Sun, Aug 1, 2010 at 6:39 AM, Brett Henderson br...@bretth.com wrote:

 I'll help incorporate this into the rest of Osmosis.  There's a few things
 to work through though.

I don't have a lot of time to work with this, but I can split up my
working branch (which includes several unrelated changes) into
separate orthogonal pieces. Git is *VERY* good at this. That would
simplify integration.


 Is there a demand for the binary format in its current incantation?  I'm not
 keen to incorporate it if nobody will use it.

I think it would be used in the mkgmap splitter, if available.

 Can the code be managed in the main OSM Subversion repo instead of GIT?

Yes. I use git personally, but there's very good SVN integration.

 Is any code reuse between Osmosis and other applications required?

Yes.

   The *.proto files must be shared with other applications that use
the binary format, including C/Java/Python/.net/

I wrote some java parser code that is intended to be shared across
the other Java osmosis applications. (Eg, I'm using it in my splitter
changes.) in crosby/binary/file and crosby/binary/*.java

I suggest that all of this be put in a separate library along with
jamesmikedupont's C/C++ code.



Scott

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Erik Johansson
On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com wrote:
 On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org wrote:

 Scott, others,

 Scott Crosby wrote:

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML.

 [...]

 The changes to osmosis are just some new tasks to handle reading and
 writing the binary format.

 [...]

 This was 3 months ago.

 What's the status of this project? Are people actively using it? Is it
 still being developed? Can the Osmosis tasks be used in the new Osmosis code
 architecture (see over on osmosis-dev) that Brett has introduced with 0.36?

 I'm curious about this as well.  The main reason for me introducing the new
 project structure was to facilitate the integration of new features like
 this.  They're relatively easy to add (some Ant and Ivy foo required ...),
[...]
 The code hasn't changed a lot, but the build processes have.


Well that's one of the thing Scott said he had no clue on how to do.
From Scotts mail:



Scott Crosby:
 // TODO's

 Probably the most important TODO is packaging and fixing the build system.
 I have no almost no experience with ant and am unfamiliar with java
 packaging practices, so I'd like to request help/advice on ant and 
 suggestions on
 how to package the common parsing/serializing code so that it can be
 re-used across different programs.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread jamesmikedup...@googlemail.com
Hi, ist there any documentation of the binary format changes?
I have implemented a c++ reader using protobuf, would update that if
there is a new format spec.
mike

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Brett Henderson
On Sun, Aug 1, 2010 at 7:34 PM, Erik Johansson erjo...@gmail.com wrote:

 On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com wrote:
  On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org
 wrote:
 
  Scott, others,
 
  Scott Crosby wrote:
 
  I would like to announce code implementing a binary OSM format that
  supports the full semantics of the OSM XML.
 
  [...]
 
  The changes to osmosis are just some new tasks to handle reading and
  writing the binary format.
 
  [...]
 
  This was 3 months ago.
 
  What's the status of this project? Are people actively using it? Is it
  still being developed? Can the Osmosis tasks be used in the new Osmosis
 code
  architecture (see over on osmosis-dev) that Brett has introduced with
 0.36?
 
  I'm curious about this as well.  The main reason for me introducing the
 new
  project structure was to facilitate the integration of new features like
  this.  They're relatively easy to add (some Ant and Ivy foo required
 ...),
 [...]
  The code hasn't changed a lot, but the build processes have.


 Well that's one of the thing Scott said he had no clue on how to do.
 From Scotts mail:



 Scott Crosby:
  // TODO's

  Probably the most important TODO is packaging and fixing the build
 system.
  I have no almost no experience with ant and am unfamiliar with java
  packaging practices, so I'd like to request help/advice on ant and
 suggestions on
  how to package the common parsing/serializing code so that it can be
  re-used across different programs.


I'll help incorporate this into the rest of Osmosis.  There's a few things
to work through though.

   - Is there a demand for the binary format in its current incantation?
   I'm not keen to incorporate it if nobody will use it.
   - Can the code be managed in the main OSM Subversion repo instead of GIT?
   - Is any code reuse between Osmosis and other applications required?  If
   only the Osmosis tasks will be managed in the Osmosis project and a
   component with common functionality managed elsewhere then I need to know
   how the common component will be managed and published for consumption in
   Osmosis.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Andreas Kalsch
What about some metrics (performance, size)? Data is the same, whether 
binary or not. So binary really has to pay off significantly.


Am 01.08.10 13:39, schrieb Brett Henderson:
On Sun, Aug 1, 2010 at 7:34 PM, Erik Johansson erjo...@gmail.com 
mailto:erjo...@gmail.com wrote:


On Sun, Aug 1, 2010 at 2:35 AM, Brett Henderson br...@bretth.com
mailto:br...@bretth.com wrote:
 On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm
frede...@remote.org mailto:frede...@remote.org wrote:

 Scott, others,

 Scott Crosby wrote:

 I would like to announce code implementing a binary OSM format
that
 supports the full semantics of the OSM XML.

 [...]

 The changes to osmosis are just some new tasks to handle
reading and
 writing the binary format.

 [...]

 This was 3 months ago.

 What's the status of this project? Are people actively using
it? Is it
 still being developed? Can the Osmosis tasks be used in the new
Osmosis code
 architecture (see over on osmosis-dev) that Brett has
introduced with 0.36?

 I'm curious about this as well.  The main reason for me
introducing the new
 project structure was to facilitate the integration of new
features like
 this.  They're relatively easy to add (some Ant and Ivy foo
required ...),
[...]
 The code hasn't changed a lot, but the build processes have.


Well that's one of the thing Scott said he had no clue on how to do.
From Scotts mail:



Scott Crosby:
 // TODO's

 Probably the most important TODO is packaging and fixing the
build system.
 I have no almost no experience with ant and am unfamiliar with java
 packaging practices, so I'd like to request help/advice on ant
and suggestions on
 how to package the common parsing/serializing code so that it can be
 re-used across different programs.


I'll help incorporate this into the rest of Osmosis.  There's a few 
things to work through though.


* Is there a demand for the binary format in its current
  incantation?  I'm not keen to incorporate it if nobody will use it.
* Can the code be managed in the main OSM Subversion repo instead
  of GIT?
* Is any code reuse between Osmosis and other applications
  required?  If only the Osmosis tasks will be managed in the
  Osmosis project and a component with common functionality
  managed elsewhere then I need to know how the common component
  will be managed and published for consumption in Osmosis.

Brett


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev
   


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Ævar Arnfjörð Bjarmason
On Sun, Aug 1, 2010 at 17:33, Andreas Kalsch andreaskal...@gmx.de wrote:
 What about some metrics (performance, size)? Data is the same, whether
 binary or not. So binary really has to pay off significantly.

What performance metrics would you like that haven't already be
covered earlier in this thread, and in the initial announcement?

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Frederik Ramm

Hi,

Brett Henderson wrote:
I'll help incorporate this into the rest of Osmosis.  There's a few 
things to work through though.


* Is there a demand for the binary format in its current
  incantation?  I'm not keen to incorporate it if nobody will use it.


I run a nightly job at Geofabrik which currently operates on plain 
(uncompressed) OSM files and goes roughly like this (every step uses 
Osmosis):


* apply daily diff to planet file
* split planet file into continents
* split each continent into countries
* split some countries into smaller units
* split some smaller units into even smaller units
* bzip2 the lot

The whole job takes from ~ 22h at night to ~ 9h in the morning, even 
though I'm ignoring the US.


A lot of time is spent just reading from, and writing to, disk and 
parsing XML. Running the whole thing with .gz files doesn't make a big 
difference - saves some disk i/o, adds some CPU time, doesn't change XML 
parsing overhead.


I wanted to test-drive the binary format as a replacement for raw .osm 
files in this setup, hoping that it would give me the i/o benefits of 
gzip compressed data but also slash XML parsing time. The numbers that 
have been posted seemed promising. I might even be able to skip the 
bzip2 step at the end if the binary format should become widely used, 
just placing binary files on the server; and use the saved time to 
re-introduce US extracts.


So here's one user who's definitely in for it - the reason I asked right 
now was that I was planning to have a go at it in the near future, and 
wanted to make sure that I'm not using an old version or going down a 
path that everyone else already discarded. - If there's proper 
integration with Osmosis around the corner then I'd wait for that.


The way I understood it, Scott was re-using some code he placed inside 
the Osmosis tree from within his splitter code. Also I could imagine 
that using this fance Google library means you'll have some format 
description files which might be shared across all projects using that 
library, perhaps even including the C++ reader that jamesmikedupont has 
built, but I'm not sure.


I prefer SVN over git for the simple reason that I only have to svn up 
and everything is there but I'm sure it is going to be a matter of 
minutes before someone from Iceland points out that the same convenience 
can be had with git if one knows what they're doing ;)


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Stefan de Konink

On Sun, 1 Aug 2010, Frederik Ramm wrote:

A lot of time is spent just reading from, and writing to, disk and parsing 
XML. Running the whole thing with .gz files doesn't make a big difference - 
saves some disk i/o, adds some CPU time, doesn't change XML parsing overhead.


I'm sorry but the parsing overhead from Java or libXML basically a known 
slowless factor. MSXML, pre/post plane parsing or even custom readers are 
not slow, and only limited to the disk.


So the binary format, per se, is only faster because:
 - smaller filesize = less io
 - encoding: no xml rewriting

Anything else is currently available using for example osmsucker.c, 
obviously not using an XML parser because all input is structured.



If the binary format can pack our doubles (lat/lon), integers 
(version/ids) and makes strings available in UTF-8, that skips CPU and IO 
overhead. But makes the data not human readable. I can totally live with 
that, and I hope the API protocol also gets protocol buffers.



Stefan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Frederik Ramm

Hi,

Stefan de Konink wrote:
I'm sorry but the parsing overhead from Java or libXML basically a known 
slowless factor. 


You don't have to be sorry, you're talking to the person who has patched 
osm2pgsql to parse XML with strcmp:


http://trac.openstreetmap.org/browser/applications/utils/export/osm2pgsql/primitive_xml_parsing

Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Anthony
On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote:
 If the binary format can pack our doubles (lat/lon)

lat/lon is stored as a double?  I always use an int (and
divide/multiply by 1000).

http://wiki.openstreetmap.org/wiki/Database_schema

Yeah, OSM seems to be doing the same thing.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Frederik Ramm

Hi,

Anthony wrote:

On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote:

If the binary format can pack our doubles (lat/lon)


lat/lon is stored as a double?  I always use an int (and
divide/multiply by 1000).


The binary format seems to encode them as 64-bit integers but protocol 
buffers makes sure that bits are not wasted if unused. Also, in his 
original post about the binary format, Scott explained:



If there is a batch of consecutive nodes to be
output that have no tags at all, I use a special dense format. I omit
the tags and store the group 'columnwise', as an array of ID's, array
of latitudes, and array of longitudes, and delta-encode each
column. This reduces header overheads and allows delta-coding to work
very effectively. With the default ~1cm granularity, nodes within
about 6 km of each other can be represented by as few as 7 bytes
each plus the costs of the metadata if it is included.


I hope that answers that.

Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Stefan de Konink

On Sun, 1 Aug 2010, Anthony wrote:


On Sun, Aug 1, 2010 at 6:00 PM, Stefan de Konink ste...@konink.de wrote:

If the binary format can pack our doubles (lat/lon)


lat/lon is stored as a double?  I always use an int (and
divide/multiply by 1000).

http://wiki.openstreetmap.org/wiki/Database_schema

Yeah, OSM seems to be doing the same thing.


OSM uses too many digits for 16bit numerics anyway ;) But I do hope that 
you don't expect your geometry engine to store this in an int ;)



Stefan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-08-01 Thread Ævar Arnfjörð Bjarmason
On Sun, Aug 1, 2010 at 21:24, Frederik Ramm frede...@remote.org wrote:

 I prefer SVN over git for the simple reason that I only have to svn up and
 everything is there but I'm sure it is going to be a matter of minutes
 before someone from Iceland points out that the same convenience can be had
 with git if one knows what they're doing ;)

Why would you think that, I live in Germany now :)

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-07-31 Thread Frederik Ramm

Scott, others,

Scott Crosby wrote:

I would like to announce code implementing a binary OSM format that
supports the full semantics of the OSM XML.


[...]


The changes to osmosis are just some new tasks to handle reading and
writing the binary format. 


[...]

This was 3 months ago.

What's the status of this project? Are people actively using it? Is it 
still being developed? Can the Osmosis tasks be used in the new Osmosis 
code architecture (see over on osmosis-dev) that Brett has introduced 
with 0.36?


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-07-31 Thread Brett Henderson
On Sun, Aug 1, 2010 at 2:26 AM, Frederik Ramm frede...@remote.org wrote:

 Scott, others,


 Scott Crosby wrote:

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML.


 [...]


  The changes to osmosis are just some new tasks to handle reading and
 writing the binary format.


 [...]

 This was 3 months ago.

 What's the status of this project? Are people actively using it? Is it
 still being developed? Can the Osmosis tasks be used in the new Osmosis code
 architecture (see over on osmosis-dev) that Brett has introduced with 0.36?


I'm curious about this as well.  The main reason for me introducing the new
project structure was to facilitate the integration of new features like
this.  They're relatively easy to add (some Ant and Ivy foo required ...),
and can be removed later on if they're not maintained or people lose
interest in them.  If there's a demand for this binary format I'm happy to
help integrate it as a new project into the existing codebase.

I believe the existing version of this binary OSM format is implemented as a
fork in a GIT repo so I suspect it will take some effort to update it to run
against 0.36.  The code hasn't changed a lot, but the build processes have.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-06-16 Thread Nolan Darilek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hey, just wondering what the status on this project is?

My app which could greatly benefit from this format is advancing to a
usable state, but the main bottleneck for me right now is database size.
My current format stores Texas in around 10G, which doesn't easily
scale. If I can fit the entire globe into just over half that, well,
that'd rock. :)

Thanks.


On 05/02/2010 04:25 AM, jamesmikedup...@googlemail.com wrote:
 Ok, my reader is now working, it can read to the end of the file,
 now am fleshing out the template dump functions to emit the data.
 g...@github.com:h4ck3rm1k3/OSM-Osmosis.git
 
 My new idea is that we could use a binary version of the rtree, I have
 already ported the rtree to my older template classes.
 
 We could use the rtree to sort the data and emit the blocks based on
 that. the rtree data structures themselves could be stored in the
 protobuffer so that it is persistent and also readable by all.
 
 
 https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg
 
 
 notes:
 http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042
 
 doxygen here :
 http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html
 
 On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote:
 -- Forwarded message --
 (Accidently did not reply to list)


 Some of these questions may be a bit premature, but I don't know how far
 along your design is, and perhaps asking them now may influence that
 design in ways that work for me.


 I'm willing to call what I've designed so far in the file format mostly
 complete, except for some of the header design issues I've brought up
 already. The question is what extensions make sense to define now, such as
 bounding boxes, and choosing the right definition for them.



 Unfortunately, this method introduces a variety of complications. First,
 the database for TX alone is 10 gigs. Ballpark estimations are that I
 might need half a TB or more to store the entire planet. I'll also need
 substantial RAM to store the working set for the DB index. All this
 means that, to launch this project on a global scale, I'd need a lot
 more funding than I as an individual am likely to find.

 With pruning out metadata, some judicious filtering of uninteresting tags,
 and increasing the granularity to 10 microdegrees (about 1m resolution),
 I've fit the whole planet in 3.7gb.


 Is there a performance or size penalty to ordering the data
 geographically rather than by ID?

 I expect no performance penalty.

 As for a size penalty, it will be a mixed bag. Ordering geographically
 should reduce the similarity for node ID numbers, increasing the space
 required to store them. It should increase the similarity for latitude and
 longitude numbers, which would reduce the size. It might change the re-use
 frequency of strings. On the whole, I suspect the filesize would remain
 within 10% of what it is now and believe it will decrease, but I have no way
 to know.


 I understand that this won't be the
 default case, but I'm wondering if there would likely be any major
 performance issues for using it in situations where you're likely to
 want bounding-box access rather than simply pulling out entities by ID.


 I have no code for pulling entities out by ID, but that would be
 straightforward to add, if there was a demand for it.

 There should be no problems at all for doing geographic queries. My vision
 for a bounding box access is that the file lets you skip 'most' blocks that
 are irrelevant to a query. 'most' depends a lot on the data and how exactly
 the dataset is sorted for geographic locality.

 But there may be problems in geographic queries. Things like
 cross-continental airways if they are in the OSM planet file would cause
 huge problems; their bounding box would cover the whole continent,
 intersecting virtually any geographic lookup. Those geographic lookups would
 then need to find the nodes in those long ways which would require loading
 virtually every block containing nodes.  I have considered solutions for
 this issue, but I do not know if problematic ways like this exist. Does OSM
 have ways like this.


 Also, is there any reason that this format wouldn't be suitable for a
 site with many active users performing geographic, read-only queries of
 the data?

 A lot of that depends on the query locality. Each block has to be
 indpendently decompressed and parsed before the contents can be examined,
 that takes around 1ms. At a small penalty in filesize, you can use 4k
 entities in a block which decompress and parse faster. If the client is
 interested in many ways in a particular geographic locality, as yours seems
 to, then this is perfect. Grab the blocks and cache the decompressed data in
 RAM where it can be re-used for subsequent geographic queries in the same
 locality.


 Again, I'd guess not, since the data isn't compressed as such,
 but maybe seeking several gigs into a 

Re: [OSM-dev] New OSM binary fileformat implementation.

2010-05-06 Thread Nolan Darilek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Sorry for the delay in responding; crazy life, and I've been fixing
existing bugs in my project rather than thinking about breaking new ground.

On 05/02/2010 12:35 AM, Scott Crosby wrote:
 
 With pruning out metadata, some judicious filtering of uninteresting tags,
 and increasing the granularity to 10 microdegrees (about 1m resolution),
 I've fit the whole planet in 3.7gb.
 
Sweet. I hope this format works for my use case.



 I have no code for pulling entities out by ID, but that would be
 straightforward to add, if there was a demand for it.
 
I would definitely need that. I'm coding to the travelingsalesman API's
DataSet interface which does include retrieval by ID.

 have to pay a disk seek whether it is in my format or not. My format being
 very dense, might let RAM hold the working set and avoid the disk seek. 1ms
 to decompress is already far faster than a hard drive, though not a SSD.

Keeping everything in RAM is probably workable. At the very least, to go
global with a format like this would seem to be a matter of starting
with a mid-level VPS that stores everything on disk and eventually
upgrading to a high-RAM, low disk space EC2 or GoGrid instance. Without
it, I'm looking at half a TB of storage and possibly a significant chunk
of RAM, and even so I don't think my current dataset can handle that.

In other words, I like the option of keeping everything in RAM far
better than what I'm doing right now. :)

 
 Could you tell me more about the kinds of lookups your application will do?
 

Sure. You can see the interface I've implemented here:

http://travelingsales.svn.sourceforge.net/viewvc/travelingsales/trunk/libosm/src/org/openstreetmap/osm/data/IDataSet.java?view=markup

Basically, the executive summary is that there are four broad kinds of
lookups:

Entity by ID, as mentioned earlier

Entities based on intersection with bounding box, currently done by the
somewhat inaccurate method of finding all contained nodes, then
returning any associated ways/relations. Would be great if I could
locate contained ways even if they don't have a node in the box, but
even if not, it'd be no worse than what's there now. :)

Entities by presence of certain tags, in some instances also with
bounding box conditions (I.e. all amenity-fuel nodes, or all of
such nodes within a given bounds)

Nearest entity to a given point, expanding outward. I can, for instance,
roughly find the nearest way by finding the node nearest to a set of
coordinates, checking for its presence in any ways, then finding the
next nearest and recursing outward until the conditions are met. The
conditions check is done externally, so the search need only return the
nearest entity, next nearest, etc.)

I know you've said elsewhere that you don't want this format to replace
the need for a database, and I respect that. I just don't quite know
where that line is. Even so, I clearly don't need all of my database's
functionality for the OSM-facing aspects of this app and hope that these
limited uses are in scope.

Thanks for thinking about and working on these issues. :)

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvi6r4ACgkQIaMjFWMehWJvigCfV6d+2UY/5Mm1HCHquTMOG5Ru
h50An0DeN8y+ADCBsVLw1V4w0xt+nql1
=wJIc
-END PGP SIGNATURE-


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-05-05 Thread jamesmikedup...@googlemail.com
I have now reworked the reader code to use only the internal google
protobuffer features for reading. It now can read the entire file to the
end. Code is commited.
Also, I have checked in an example small protobuffer file for testing.
http://github.com/h4ck3rm1k3/OSM-Osmosis

any testers would be appreciated.

in the dir OSM-Osmosis/src/crosby/binary/cpp, build and then run :
./osmprotoread albania.osm.protobuf3  out.txt

mike

On Sun, May 2, 2010 at 11:25 AM, jamesmikedup...@googlemail.com 
jamesmikedup...@googlemail.com wrote:

 Ok, my reader is now working, it can read to the end of the file,
 now am fleshing out the template dump functions to emit the data.
 g...@github.com:h4ck3rm1k3/OSM-Osmosis.git

 My new idea is that we could use a binary version of the rtree, I have
 already ported the rtree to my older template classes.

 We could use the rtree to sort the data and emit the blocks based on
 that. the rtree data structures themselves could be stored in the
 protobuffer so that it is persistent and also readable by all.


 https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReghttps://code.launchpad.net/%7Ejamesmikedupont/+junk/EPANatReg


 notes:
 http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042

 doxygen here :
 http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html

 On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote:
  -- Forwarded message --
  (Accidently did not reply to list)
 
 
  Some of these questions may be a bit premature, but I don't know how far
  along your design is, and perhaps asking them now may influence that
  design in ways that work for me.
 
 
  I'm willing to call what I've designed so far in the file format mostly
  complete, except for some of the header design issues I've brought up
  already. The question is what extensions make sense to define now, such
 as
  bounding boxes, and choosing the right definition for them.
 
 
 
  Unfortunately, this method introduces a variety of complications. First,
  the database for TX alone is 10 gigs. Ballpark estimations are that I
  might need half a TB or more to store the entire planet. I'll also need
  substantial RAM to store the working set for the DB index. All this
  means that, to launch this project on a global scale, I'd need a lot
  more funding than I as an individual am likely to find.
 
  With pruning out metadata, some judicious filtering of uninteresting
 tags,
  and increasing the granularity to 10 microdegrees (about 1m resolution),
  I've fit the whole planet in 3.7gb.
 
 
  Is there a performance or size penalty to ordering the data
  geographically rather than by ID?
 
  I expect no performance penalty.
 
  As for a size penalty, it will be a mixed bag. Ordering geographically
  should reduce the similarity for node ID numbers, increasing the space
  required to store them. It should increase the similarity for latitude
 and
  longitude numbers, which would reduce the size. It might change the
 re-use
  frequency of strings. On the whole, I suspect the filesize would remain
  within 10% of what it is now and believe it will decrease, but I have no
 way
  to know.
 
 
  I understand that this won't be the
  default case, but I'm wondering if there would likely be any major
  performance issues for using it in situations where you're likely to
  want bounding-box access rather than simply pulling out entities by ID.
 
 
  I have no code for pulling entities out by ID, but that would be
  straightforward to add, if there was a demand for it.
 
  There should be no problems at all for doing geographic queries. My
 vision
  for a bounding box access is that the file lets you skip 'most' blocks
 that
  are irrelevant to a query. 'most' depends a lot on the data and how
 exactly
  the dataset is sorted for geographic locality.
 
  But there may be problems in geographic queries. Things like
  cross-continental airways if they are in the OSM planet file would cause
  huge problems; their bounding box would cover the whole continent,
  intersecting virtually any geographic lookup. Those geographic lookups
 would
  then need to find the nodes in those long ways which would require
 loading
  virtually every block containing nodes.  I have considered solutions for
  this issue, but I do not know if problematic ways like this exist. Does
 OSM
  have ways like this.
 
 
  Also, is there any reason that this format wouldn't be suitable for a
  site with many active users performing geographic, read-only queries of
  the data?
 
  A lot of that depends on the query locality. Each block has to be
  indpendently decompressed and parsed before the contents can be examined,
  that takes around 1ms. At a small penalty in filesize, you can use 4k
  entities in a block which decompress and parse faster. If the client is
  interested in many ways in a particular geographic locality, as yours
 seems
  to, then this is perfect. Grab the blocks and cache the 

Re: [OSM-dev] New OSM binary fileformat implementation.

2010-05-02 Thread jamesmikedup...@googlemail.com
Ok, my reader is now working, it can read to the end of the file,
now am fleshing out the template dump functions to emit the data.
g...@github.com:h4ck3rm1k3/OSM-Osmosis.git

My new idea is that we could use a binary version of the rtree, I have
already ported the rtree to my older template classes.

We could use the rtree to sort the data and emit the blocks based on
that. the rtree data structures themselves could be stored in the
protobuffer so that it is persistent and also readable by all.


https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg


notes:
http://www.openstreetmap.org/user/h4ck3rm1k3/diary/9042

doxygen here :
http://xhema.flossk.org:8080/mapdata/03/EPANatReg/html/classRTreeWorld.html

On Sun, May 2, 2010 at 7:35 AM, Scott Crosby scrosb...@gmail.com wrote:
 -- Forwarded message --
 (Accidently did not reply to list)


 Some of these questions may be a bit premature, but I don't know how far
 along your design is, and perhaps asking them now may influence that
 design in ways that work for me.


 I'm willing to call what I've designed so far in the file format mostly
 complete, except for some of the header design issues I've brought up
 already. The question is what extensions make sense to define now, such as
 bounding boxes, and choosing the right definition for them.



 Unfortunately, this method introduces a variety of complications. First,
 the database for TX alone is 10 gigs. Ballpark estimations are that I
 might need half a TB or more to store the entire planet. I'll also need
 substantial RAM to store the working set for the DB index. All this
 means that, to launch this project on a global scale, I'd need a lot
 more funding than I as an individual am likely to find.

 With pruning out metadata, some judicious filtering of uninteresting tags,
 and increasing the granularity to 10 microdegrees (about 1m resolution),
 I've fit the whole planet in 3.7gb.


 Is there a performance or size penalty to ordering the data
 geographically rather than by ID?

 I expect no performance penalty.

 As for a size penalty, it will be a mixed bag. Ordering geographically
 should reduce the similarity for node ID numbers, increasing the space
 required to store them. It should increase the similarity for latitude and
 longitude numbers, which would reduce the size. It might change the re-use
 frequency of strings. On the whole, I suspect the filesize would remain
 within 10% of what it is now and believe it will decrease, but I have no way
 to know.


 I understand that this won't be the
 default case, but I'm wondering if there would likely be any major
 performance issues for using it in situations where you're likely to
 want bounding-box access rather than simply pulling out entities by ID.


 I have no code for pulling entities out by ID, but that would be
 straightforward to add, if there was a demand for it.

 There should be no problems at all for doing geographic queries. My vision
 for a bounding box access is that the file lets you skip 'most' blocks that
 are irrelevant to a query. 'most' depends a lot on the data and how exactly
 the dataset is sorted for geographic locality.

 But there may be problems in geographic queries. Things like
 cross-continental airways if they are in the OSM planet file would cause
 huge problems; their bounding box would cover the whole continent,
 intersecting virtually any geographic lookup. Those geographic lookups would
 then need to find the nodes in those long ways which would require loading
 virtually every block containing nodes.  I have considered solutions for
 this issue, but I do not know if problematic ways like this exist. Does OSM
 have ways like this.


 Also, is there any reason that this format wouldn't be suitable for a
 site with many active users performing geographic, read-only queries of
 the data?

 A lot of that depends on the query locality. Each block has to be
 indpendently decompressed and parsed before the contents can be examined,
 that takes around 1ms. At a small penalty in filesize, you can use 4k
 entities in a block which decompress and parse faster. If the client is
 interested in many ways in a particular geographic locality, as yours seems
 to, then this is perfect. Grab the blocks and cache the decompressed data in
 RAM where it can be re-used for subsequent geographic queries in the same
 locality.


 Again, I'd guess not, since the data isn't compressed as such,
 but maybe seeking several gigs into a file to locate nearby entities
 would be a factor, or it may work just fine for single-user access but
 not so well with multiple distinct seeks for different users in widely
 separate locations.


 Ultimately, it depends on your application, which has a particular locality
 in its lookups. Application locality, combined with a fileformat, defines
 the working set size. If RAM is insufficient to hold the working set, you'll
 have to pay a disk seek whether it is in my format or not. My 

Re: [OSM-dev] New OSM binary fileformat implementation.

2010-05-01 Thread Scott Crosby

 Initial thoughts from my end (as main developer of Osmosis) is that it
 would be best to keep it separate until you've let it evolve somewhat.  If
 you can develop it as a plugin for now it will let us (ie. Osmosis core, and
 osmosis bin format) remain independent until it matures and stabilises.  If
 it has a wide enough audience (ie. useful to more than a small handful of
 people) then we can look at incorporating it into the core of Osmosis.  At
 that point it will need to meet the current Osmosis code quality checks (eg.
 checkstyle), have unit tests, and pass a code review.  None of that is too
 difficult, but I need to ensure it doesn't add a maintenance burden.


I agree wholeheartedly with letting it evolve and I'm interested in hearing
yours (and others) thoughts on what additional features to include or
exclude.


 I can possibly help with some ant and ivy advice in terms of how to
 incorporate it with Osmosis.  It should be possible to make the existing
 osmosis library an Ivy dependency of your library.  It might make sense to
 split your library in two parts, the generic re-usable code, and the Osmosis
 specific tasks.  But I'm only speculating until I understand what you've
 done.


Thanks, I'd appreciate that advice. The code is already segregated into
osmosis independent and osmosis dependent packages; the independent code is
already reused in my patches to the mkgmap splitter to read the binary
files.


 Longer term I'm open to suggestions on how Osmosis should incorporate new
 features like this.  It may actually make more sense to pull some existing
 features out of Osmosis and make them plugins also, or it may be simpler to
 just keep adding to the core.  But these questions are more suited to the
 osmosis-dev list.


You've got a nice architecture; it was very easy to figure out how to get my
code to tie-in to your design, at least as a built in. Making it work as a
plugin will be more effort.

My biggest wish is that there should be a common library used by the
different OSM software for representing lowlevel concepts such as points,
bounding boxes, entities, etc, or at least define an baseline interface that
the different OSM software offers for accessing their different entity
implementations. I could target that. mkgmap, osmosis, the splitter, and
josm all have 'Node' and 'Way' defined in semantically identical fashions,
yet are incompatible types. To support all 4 of them requires reimplementing
the binary serializer and deserializer 4 times, quadrupling the chances of
having a bug. Such an interface would make it possible to have a shared
XML-entity parser/writer implementation and make code more reusable across
these applications. For instance, I'm interested in reusing josm's quadtree
implementation in osmosis. I have a local patch that refactors it to support
anything that has a getBBox() and isUsable() functions, but using it
requires importing josm's notion of bounding boxes and coordinates into
osmosis along with refactoring out dependencies on
java.awt.geom.Rectangle2D, and org.openstreetmap.josm.Main (via LatLon).

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-05-01 Thread Nolan Darilek
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 05/01/2010 09:47 AM, Scott Crosby wrote:

 I agree wholeheartedly with letting it evolve and I'm interested in hearing
 yours (and others) thoughts on what additional features to include or
 exclude.
 

Some of these questions may be a bit premature, but I don't know how far
along your design is, and perhaps asking them now may influence that
design in ways that work for me.

I'm developing an accessible map-browsing, GPS navigation app. You can
read my initial blog post on the project here:

http://thewordnerd.info/2010/03/introducing-hermes/

At the moment, this uses LibOSM from travelingsalesman and an
as-of-yet-unreleased dataset using MongoDB for the geospatial queries. I
don't really understand enough higher-level math to roll my own
geospatial code, especially since I can't visually verify the results,
so it's easier to use LibOSM and roll a dataset that I can run on a
production site than it would be to re-invent the wheel.

Unfortunately, this method introduces a variety of complications. First,
the database for TX alone is 10 gigs. Ballpark estimations are that I
might need half a TB or more to store the entire planet. I'll also need
substantial RAM to store the working set for the DB index. All this
means that, to launch this project on a global scale, I'd need a lot
more funding than I as an individual am likely to find.

I'm really excited to read your numbers for compression, because at
first glance, this would seem to take the project from something that
I'd need substantial EC2 infrastructure for, to something I can run on a
mid-level VPS, slashing costs from $1000+month to $50 or so/month. So my
questions:

Is there a performance or size penalty to ordering the data
geographically rather than by ID? I understand that this won't be the
default case, but I'm wondering if there would likely be any major
performance issues for using it in situations where you're likely to
want bounding-box access rather than simply pulling out entities by ID.

Also, is there any reason that this format wouldn't be suitable for a
site with many active users performing geographic, read-only queries of
the data? Again, I'd guess not, since the data isn't compressed as such,
but maybe seeking several gigs into a file to locate nearby entities
would be a factor, or it may work just fine for single-user access but
not so well with multiple distinct seeks for different users in widely
separate locations.

Anyhow, I realize these questions may be naive at such an early stage,
but the idea that I may be able to pull this off without infrastructure
beyond my budget is an appealing one. Are there any reasons your binary
format wouldn't be able to accomodate this situation, or couldn't be
optimized to do so?

Thanks.
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkvcR80ACgkQIaMjFWMehWLAngCcDTYdjW6SrKaPoKdqjjEY4r3U
C34AnR4f8NEM18Z07Xr9vjli8/6UFYCz
=feGc
-END PGP SIGNATURE-


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-29 Thread Marcus Wolschon
Sounds cool!

What indice exist?
Can I access elements by ID?
what about access by bounding-box?
Are there navigatable back-references from Members to relations
and from Nodes to Ways?


Marcus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-29 Thread Scott Crosby
On Thu, Apr 29, 2010 at 2:15 AM, Frederik Ramm frede...@remote.org wrote:

 Scott,


 Scott Crosby wrote:

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML.


 This all sounds very interesting, and you seem to have spent a lot of
 thought on it and documented it well.

 If I understand it correctly, this is meant to be a replacement for the XML
 files as a transport format for XML data. It is not meant to offer random
 access in any way, and thus differs from other attempts at creating binary
 formats that could be used in lieu of databases, having indexes and all.


Roughly, yes, this is intended as a transport format, but the design is
flexible.

if the file is physically ordered so that blocks have strong geographic
locality and block metadata includes bounding boxes, then those bounding
boxes can be used to skip unneeded blocks. If the file is physically ordered
in type/id order (as the current planet is), and block metadata includes the
minimum and maximum id for each block, then, as before, the metadata can be
used to only examine desired blocks. If both searches are critical, then
generate two files, one with geographic locality and one sorted by type/id.
Storing 'planet-omitmeta.bin' *AND*  'planet.bin' is still cheaper than
storing 'planet-100303.osm.gz'

As you point out, a binary format means different things to different
people. I chose this design because it would be useful as-is and it could
offer future features without requiring changing the file format. Geographic
searches and searches by typeid merely wait on implementing code for
physically reordering the file to be written. Adding the appropriate fields
to the metadata header is trivial in comparison. In addition, nothing in the
design precludes adding having fileblocks that contain data other than OSM
entities. Fileblocks can contain an index from a node ID to the ways and
relations it is contained. Metadata headers on these blocks can indicate
which block contains the index entries for a particular node.

My vision is that in many cases it is better to have a simple format that is
very dense and lets you skip the 95% of the data that you don't care about,
rather than design a very complex or significantly larger format (e.g, a
relational database) that lets you skip  99+% of the data that you don't
care about. The more advanced formats may return less data, but the simple
format is still 20 times less data than reading everything.


Maybe we should be careful about naming these formats to make their purpose
 clearer. The generic OSM binary format seems to mean different things to
 different people. The file extension .bin is perhaps not the best choice.

 Have you considered/evaluated Fast Infoset and if so, what were the
 reasons against that?



No, I was not aware of that compressed XML design.


  It is 5x-10x faster at
 reading and writing and 30-50% smaller


 The size figure is obviously compared to bz2; is the 5x-10x faster also
 compared to bz2, and if so, compared to the native Java bz2 or the external
 C one?


For filesizes, I was comparing to bzip2. For performance, I compared against
the gzip'ed planet; I didn't have the patience to compress or decompress
that much XML in bzip2.


  an entire planet, including
 all metadata, can be read in about 12 minutes and written in about 50
 minutes on a 3 year old dual-core machine.


 How did you measure write performance decoupled from read performance?
 Surely your 3 year old dual-core machine did not have the 150 gigs of RAM
 needed to suck the entire planet into memory?


I benchmarked:

   osmosis --read-bin file=planet.bin --write-null
   osmosis --read-bin file=planet.bin --write-bin file=planet2.bin

And measured ~12 minutes of CPU time for the first and ~60 minutes of CPU
time for the second.

With a dual-core system, using '--b bufferCapacity=2' gives some
concurrency and writing can be done in somewhere around 40 minutes.


 You have paid an impressive amount attention to details in order to achieve
 the good performance and compression rates that you do. I'm slightly
 concerned about the robustness of it all - in the past, we often had planet
 files that were broken one way or the other, and it was usually possible to
 remedy this with some standard grep, sed, or dd actions - if one of your
 files ever breaks then I guess it is likely to be complete garbage ;-)


Without knowing how those prior planets were broken, I can't say whether
analogous breakage of files in my format could be repaired.




  Probably the most important TODO is packaging and fixing the build system.
 I have no almost no experience with ant and am unfamiliar with java
 packaging practices, so I'd like to request help/advice on ant and
 suggestions on
 how to package the common parsing/serializing code so that it can be
 re-used across different programs.


 I suggest to ask on osmosis-dev, an get your new code into the Osmosis
 

Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-29 Thread Scott Crosby
On Thu, Apr 29, 2010 at 2:16 AM, jamesmikedup...@googlemail.com 
jamesmikedup...@googlemail.com wrote:

 HI,
 I am working on the c++ decoder, can you please tell me roughly where
 the message objects are?


(CC'ing back to the list)


 I need to understand how to decode this file.



 Do you have any header information or is it all raw protobuf? You
 mentioned blocks being compressed, are you using the protobuf tools to
 do this? can you give me a rough layout of how to parse the bin file?


Parsing a protocol buffer requires knowing its length beforehand, which adds
in some complexities.

Currently a file consists of repetitions of the following:

  32-bit integer encoding the header length, I believe it is in Java's
default big endian order
  serialized FileBlockHeader message   (see fileformat.proto)
  serialized Blob message'  (see fileformat.proto, the length is given in
the header message)

Code implementing this is in BlockInputStream and BlockOutputStream and
FileBlock in package crosby.binary.file.
To understand how compression is used, look at the definition of the Blob
message and FileBlock.java

Inside the blob is a serialized osm HeaderBlock or osm PrimitiveBlock
message.

Note that I am currently unhappy with the current file header scheme and may
make incompatible changes.
There's definitely some cruft in fileformat.proto that needs to be removed.

thanks,
 mike

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-29 Thread Brett Henderson
On Thu, Apr 29, 2010 at 10:15 PM, Scott Crosby scrosb...@gmail.com wrote:




  Probably the most important TODO is packaging and fixing the build
 system.
 I have no almost no experience with ant and am unfamiliar with java
 packaging practices, so I'd like to request help/advice on ant and
 suggestions on
 how to package the common parsing/serializing code so that it can be
 re-used across different programs.


 I suggest to ask on osmosis-dev, an get your new code into the Osmosis
 trunk quickly so people can play with it.


 I think it would be prudent to get suggestions from the OSM community
 first. Once the code is in osmosis, our ability to make
 compatability-breaking changes to the format will be reduced.



Hi Scott,

Thanks for all the great work.  It will be a little while (at least several
days) before I can take a look at what you've done.

Initial thoughts from my end (as main developer of Osmosis) is that it would
be best to keep it separate until you've let it evolve somewhat.  If you can
develop it as a plugin for now it will let us (ie. Osmosis core, and osmosis
bin format) remain independent until it matures and stabilises.  If it has a
wide enough audience (ie. useful to more than a small handful of people)
then we can look at incorporating it into the core of Osmosis.  At that
point it will need to meet the current Osmosis code quality checks (eg.
checkstyle), have unit tests, and pass a code review.  None of that is too
difficult, but I need to ensure it doesn't add a maintenance burden.

I can possibly help with some ant and ivy advice in terms of how to
incorporate it with Osmosis.  It should be possible to make the existing
osmosis library an Ivy dependency of your library.  It might make sense to
split your library in two parts, the generic re-usable code, and the Osmosis
specific tasks.  But I'm only speculating until I understand what you've
done.

Longer term I'm open to suggestions on how Osmosis should incorporate new
features like this.  It may actually make more sense to pull some existing
features out of Osmosis and make them plugins also, or it may be simpler to
just keep adding to the core.  But these questions are more suited to the
osmosis-dev list.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread jamesmikedup...@googlemail.com
On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby scrosb...@gmail.com wrote:
 Hello!

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML. It is 5x-10x faster at
 reading and writing and 30-50% smaller; an entire planet, including
 all metadata, can be read in about 12 minutes and written in about 50
 minutes on a 3 year old dual-core machine. I have implemented an
 osmosis reader and writer and have enhancements to the map splitter to
 read the format. Code is pure Java and uses Google protocol buffers
 for the low-level serialization.
Thats very interesting. Would like to see this working with c++ as
well. Will have to look at the code.
thanks for sharing,
mike

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Scott Crosby
On Wed, Apr 28, 2010 at 1:16 PM, OJ W ojwli...@googlemail.com wrote:

 where's the .proto file?


Proto files should be in the osmosis git repository at:

src/crosby/binary/fileformat.proto
src/crosby/binary/osmformat.proto


 do you have data files in this format available to download?


No, I do not have any files at this time as I am not ready to declare the
file format as being stable.

You can make your own test files with --write-bin.

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Scott Crosby

 I have a quick question: does the format support inserting data to an
 existing file? Or is it just some binary serialization?


The format is a binary serialization design and does not support
random-access read and write semantics. For that a database is probably more
suitable.

However, some changes can be done to a file relatively cheaply. Data can be
trivially appended. Rewriting a file could be fairly cheap as each fileblock
is independently decodable and contains only 8,000 OSM entities. A fileblock
can be copied from an input to an output without decompressing or parsing
it. Metadata in the block header could be used to find out which fileblocks
can be copied unchanged, or used to filter out unwanted blocks.


 Now we just need the dump tool for the database to create some planet dump
 file in your format.


If osmosis is used as the dump tool, I believe a --write-bin should suffice
to make a planet dump. The code just ties into the existing Source/Sink
osmosis architecture.

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Tom Hughes
On 28/04/10 20:07, Scott Crosby wrote:

 Now we just need the dump tool for the database to create some
 planet dump file in your format.

 If osmosis is used as the dump tool, I believe a --write-bin should
 suffice to make a planet dump. The code just ties into the existing
 Source/Sink osmosis architecture.

Osmosis isn't the dump tool, no. The planetdump program is.

If we were going to offer a binary version for download the it would be 
better to generate it from the xml one anyway, rather than from the 
database, so that the two versions are consistent.

Tom

-- 
Tom Hughes (t...@compton.nu)
http://compton.nu/

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Stefan de Konink
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Op 28-04-10 21:50, Tom Hughes schreef:
 Osmosis isn't the dump tool, no. The planetdump program is.
 
 If we were going to offer a binary version for download the it would be 
 better to generate it from the xml one anyway, rather than from the 
 database, so that the two versions are consistent.

Since the current XML version is inconsistent, a direct database dump
will be more consistent than any conversion.

Matt received a lot of examples for inconsistencies already.


Stefan
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREKAAYFAkvYlLAACgkQYH1+F2Rqwn3s7wCfb+I6iG7ieZEXXycojP9BJNgZ
UdwAmgK06M/9tESSIxsypsV7XjtWpw2S
=d5pO
-END PGP SIGNATURE-

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Grant Slater
On 28 April 2010 21:04, Stefan de Konink ste...@konink.de wrote:
 Since the current XML version is inconsistent, a direct database dump
 will be more consistent than any conversion.

 Matt received a lot of examples for inconsistencies already.


Since: http://trac.openstreetmap.org/changeset/20396 ?

/ Grant

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Stefan de Konink
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA512

Op 28-04-10 22:11, Grant Slater schreef:
 On 28 April 2010 21:04, Stefan de Konink ste...@konink.de wrote:
 Since the current XML version is inconsistent, a direct database dump
 will be more consistent than any conversion.

 Matt received a lot of examples for inconsistencies already.

 
 Since: http://trac.openstreetmap.org/changeset/20396 ?

Did that fix *all* the older inconsistencies? I mean did you run a query
to verify all referential constraints on the current table?


Stefan
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.14 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEAREKAAYFAkvYmkcACgkQYH1+F2Rqwn2AxQCdHjZt9SI6gpKlqHOKwb7BXBYQ
FHIAnicg9/c3qOTOgUjJiyT4HhPvG7nG
=EJdz
-END PGP SIGNATURE-

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread jamesmikedup...@googlemail.com
I got it built,

Build instructions :
first get the protobuf, and build manually,

  526  svn checkout http://protobuf.googlecode.com/svn/trunk/ protobuf-read-only
  527  ls
  528  cd protobuf-read-only/
  529  ls
  530  bash ./autogen.sh
  531  ls
  532  make
  533  ./configure
  534  make
  535  sudo make install

go into the java dir and use ant to build the jar files.

got to the sources of the OSM tool :
install the protobuf into the lib/compile, normally this will be
handled by ivy, but there are issues.
 cp 
/home/mdupont/experiments/osm/protobuf-read-only/java/target/protobuf-java-2.3.1-pre.jar
lib/compile/


  607  cd src/
  611  cd crosby/
  613  cd binary/
  614  ls
  615  protoc --java_out=../.. fileformat.proto
  616  protoc --java_out=../.. osmformat.proto

this generates the needed code.

now you can build.


On Wed, Apr 28, 2010 at 7:02 PM, Scott Crosby scrosb...@gmail.com wrote:
 Hello!

 I would like to announce code implementing a binary OSM format that
 supports the full semantics of the OSM XML. It is 5x-10x faster at
 reading and writing and 30-50% smaller; an entire planet, including
 all metadata, can be read in about 12 minutes and written in about 50
 minutes on a 3 year old dual-core machine. I have implemented an
 osmosis reader and writer and have enhancements to the map splitter to
 read the format. Code is pure Java and uses Google protocol buffers
 for the low-level serialization.

 Comparing the file sizes:

  8.2gb   planet-100303.osm.bz2
 12  gb   planet-100303.osm.gz
  5.2gb   planet-omitmeta.bin
  6.2gb   planet.bin

 The omitmeta version omits the uid/user/version/timestamp metadata
 fields on each entity and are faster to generate and read.

 The design is very extensible. The low-level file format is designed
 to support random access at the 'fileblock' granularity, where a
 fileblock can contain ~8k OSM entities. There is *no* tag hardcoding
 used; all keys and values are stored in full as opaque strings. For
 future scalability, 64 bit node/way/relation ID's are assumed. The
 current serializer preserves the order of OSM entities and tags on OSM
 entities. To flexibly handle multiple resolutions, the granularity, or
 resolution used for representing locations and timestamps is
 adjustable in multiples of 1 millisecond and 1 nanodegree and can be
 set independently for each fileblock. The default scaling factor is
 1000 milliseconds and 100 nanodegrees, corresponding to about ~1cm at
 the equator. These are the current resolution of the OSM database.

 Smaller files can be generated. At 10 microdegrees granularity,
 corresponding to about 1m of resolution, the filesize decreases by
 about 1gb. Space may also be saved by removing uninteresting UUID
 tags or perhaps by having stronger geographic locality when building
 the file.

 I have also tested the binary format on some SRTM contour lines in OSM
 0.5 XML format, obtaining about a 50:1 compression ratio. This might
 be further improved by choosing a granularity equal to the isohypsis
 grid size.

 // Testing

 I have tested this code on the Cloudmade extract of Rhode
 Island. After converting the entire file to and from binary format,
 the XML output is bytewise identical to original file except for the
 one line indicating the osmosis version number.

 When run through the splitter, the output is not bytewise identical to
 before because of round-off errors 16 digits after the decimal point;
 this could be fixed by having the splitter behave like osmosis and
 only output 7 significant digits.

 // To use:

 Demonstration code is available on github at

    http://github.com/scrosby/OSM-Osmosis   and
    http://github.com/scrosby/OSM-splitter

 See the 'master' branches.

 Please note that this is at present unpackaged demonstration code and
 the fileformat may change to incorporate suggestions. Also note that
 the shared code between the splitter and osmosis currently lives in
 the osmosis git repository. You'll also need to go into the
 crosby.binary directory and run the protocol compiler ('protoc') on
 the .proto files (See comments in those files for the command line.).


 /// The design ///

 I use Google protocol buffers for the low-level store. Given a
 specification file of one or more messages, the protocol buffer
 compiler writes low-level serialization code. Messages may contain
 other messages, forming hierarchical structures. Protocol buffers also
 support extensibility; new fields can be added to a message and old
 clients can read those messages without recompiling. For more details,
 please see http://code.google.com/p/protobuf/. Google officially
 supports C++, Java, and Python, but compilers exist for other
 languages.  An example message specification is:

 message Node {
    required sint64 id = 1;
    required sint64 lat = 7;
    required sint64 lon = 8;
    repeated uint32 keys = 9 [packed = true]; // Denote strings
    repeated uint32 vals = 10 [packed = true];// Denote strings
    

Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread jamesmikedup...@googlemail.com
I have created a branch with the cpp generated code,
g...@github.com:h4ck3rm1k3/OSM-Osmosis.git

the new c++ lib is called libosmprotobuf, what a great name.

OSM-Osmosis/src/crosby/binary/
run make to generate the code, but i checked in the results.

in the subdir :
OSM-Osmosis/src/crosby/binary/cpp

bash ./autogen.sh
./configure
make

it uses /usr/local/lib/libprotobuf.la which is a bit of a hack.

my next step will be to hook this up to my existing c++ code
https://code.launchpad.net/~jamesmikedupont/+junk/EPANatReg

I hope that I will be able to make nice small C++ tools that can then
process and or produce these buffer files.

very interesting stuff that google has produced, thanks scott for
making this public,
mike

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] New OSM binary fileformat implementation.

2010-04-28 Thread Grant Slater
On 28 April 2010 21:27, Stefan de Konink ste...@konink.de wrote:
 Since the current XML version is inconsistent, a direct database dump
 will be more consistent than any conversion.

 Matt received a lot of examples for inconsistencies already.

 Since: http://trac.openstreetmap.org/changeset/20396 ?

 Did that fix *all* the older inconsistencies? I mean did you run a query
 to verify all referential constraints on the current table?


Well if there are any inconsistencies; don't keep them secret.
Lets get them fixed. :-)

/ Grant

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev