Re: [OSM-dev] Timestamp in PBF files

2012-12-11 Thread Frederik Ramm

Hi,

On 12/05/2012 05:18 AM, Scott Crosby wrote:

I've merged the pull request and I think I've added the right
changelog, maven, and debian glue for the new version and pushed the
v1.3.0 tag to github.


Thank you.

As of today, the Geofabrik .osm.pbf downloads contain replication 
information in the PBF header.


The software most of you use to read PBF files is unlikely to make use 
of that information already, but will not be harmed by it.


I've built preliminary support for these new fields into Osmium and you 
can get my version here: 
https://github.com/woodpeck/osmium/tree/timestamp - if you build the 
osmium_debug tool in examples it will dump the new headers to stdout. 
If you plan to read/write replication information in your own programs I 
suggest that you wait until support is available in Jochen's Osmium 
version as the interface is likely to change slightly.


I expect that Marqqs' osmupdate utility either already supports these 
new fields or will do so in the very near future.


It would be great if someone were to add support to Osmosis which is 
likely to be a bit tricky as you have to shove replication information 
through the pipeline, but if all else fails I might have a go at it 
during the holidays.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-11 Thread Jochen Topf
On Tue, Dec 11, 2012 at 03:53:39PM +0100, Frederik Ramm wrote:
 stdout. If you plan to read/write replication information in your
 own programs I suggest that you wait until support is available in
 Jochen's Osmium version as the interface is likely to change
 slightly.

Actually, no, you should not wait for my Osmium version. Just go ahead
and do whatever you have to do.

I think the implementation and interface of all this hasn't been thought
through properly and until I or somebody else does, I am not planning to add
this to Osmium. For starters this makes XML and PBF files incompatible which is
not good. Next it has to be figured out what changes to the input data lead to
changes in the output of these flags. Obviously when you apply a diff those
headers should change, shouldn't they? Those things all have to be figured out
and implemented properly.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-11 Thread Frederik Ramm

Hi,

On 11.12.2012 15:53, Frederik Ramm wrote:

I've built preliminary support for these new fields into Osmium and you
can get my version here:
https://github.com/woodpeck/osmium/tree/timestamp - if you build the
osmium_debug tool in examples it will dump the new headers to stdout.


I've also made my timestamp branch of Peter's osm-history-splitter 
(which I use to create these files) available on github, here:
https://github.com/woodpeck/osm-history-splitter/tree/timestamp - it 
takes commandline switches that allow you to set the headers to whatever 
you want. This requires the timestamp-supporting version of Osmium 
linked above.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-11 Thread Brett Henderson
On 12 December 2012 01:53, Frederik Ramm frede...@remote.org wrote:

 Hi,


snip



 It would be great if someone were to add support to Osmosis which is
 likely to be a bit tricky as you have to shove replication information
 through the pipeline, but if all else fails I might have a go at it during
 the holidays.


I've done something similar with the streaming replication tasks  (ie.
--receive-replication, --replicate-apidb, --send-replication-data,
--write-replication).  They exchange state information from source to sink
via the new task initialize method which accepts a map of arguments.
Typically the source task at the start of the pipeline passes a
ReplicationState object through the pipeline in a map key called
replication.state (I think ... I'm not looking at the source code).  The
sink task then updates the state object with the current persisted state
during the initialize call, and by the time the initialize call returns,
the source task can use it to determine what replication point to start
from.

As part of that change I updated tasks such as --buffer to propagate the
initialize information properly across threads.  I believe other tasks such
as --merge will still need to be updated.

I doubt if I'll be able to provide much assistance in implementing this.  I
have another child due early in the New Year so I'll probably be off the
radar for a while :-)
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-04 Thread Frederik Ramm

Hi,

On 12/04/12 07:50, Jochen Topf wrote:

That still isn't specific to Osmosis. Somebody else could implement this
algorithm. Markus seems to have done so, albeit a bit differently. The
algorithm should be documented somewhere and if you think there can be
other algorithms, maybe this one should get a name.


I'd call it the Osmosis algorithm, and therefore name the header 
fields osmosis_..., to make clear that they are intended for this 
algorithm. - I agree that it would be nice for the algorithm to be 
documented somewhere but I'm loathe to make this a prerequisite for the 
proposed changes to OSM-Binary because it will unnecessarily delay the 
process.



But it should not be
named after one of the programs that happen to implement it.


That would then be pure coincidence.


This is a similar issue as with the main OSM map, which was named Mapnik
after the rendering program which lead to no end of confusion.


Frankly, I don't care what it is called, I just want to get on with the 
show. Making up a new name for it now and telling everyone that this 
new name is what they've been using all the time is just as confusing 
but if anyone thinks this is important enough to spend the time to come 
up with a new name (or bug Brett to come up with one) then they're 
welcome to do so. Preferably within a couple of days.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-04 Thread Pieren
On Tue, Dec 4, 2012 at 9:12 AM, Frederik Ramm frede...@remote.org wrote:

 Frankly, I don't care what it is called,

call it : a la Osmosis ;-)

Pieren

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-04 Thread Scott Crosby
On Tue, Dec 4, 2012 at 2:12 AM, Frederik Ramm frede...@remote.org wrote:

 Hi,


 On 12/04/12 07:50, Jochen Topf wrote:

 That still isn't specific to Osmosis. Somebody else could implement this
 algorithm. Markus seems to have done so, albeit a bit differently. The
 algorithm should be documented somewhere and if you think there can be
 other algorithms, maybe this one should get a name.


 I'd call it the Osmosis algorithm, and therefore name the header fields
 osmosis_..., to make clear that they are intended for this algorithm. - I
 agree that it would be nice for the algorithm to be documented somewhere
 but I'm loathe to make this a prerequisite for the proposed changes to
 OSM-Binary because it will unnecessarily delay the process.


I think I'm willing to call it a consensus. Thank you everyone for the
discussion.

And Frederik thank you for sending me a pull request. I've committed it
as-is. I'm happy to go with osmosis_* for the fieldnames. It's the osmosis
algorithm, and we can always add on other metadata fields in the future.
I've merged the pull request and I think I've added the right changelog,
maven, and debian glue for the new version and pushed the v1.3.0 tag to
github.

If I've missed anything, please send me a followup email.

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-03 Thread Frederik Ramm

Hi,

On 22.11.2012 00:18, Scott Crosby wrote:

I think for Frederik's immediate needs, we should add a have a field
called osmosis_replication_timestamp or osmosis_replication_state = 32,
which contains a submessage containing a replication timestamp and other
replication data that he feels is appropriate.

As for the timestamp =18 field, Dennis, what was your intended use of
this field?  Marqqs, what is the intended use of your timestamp
optional_features field?


Since nobody has come forward with further requests, may I humbly 
suggest that we add three new fields:


a 64bit integer osmosis_replication_timestamp for the replication 
timestamp, expressed in seconds since the epoch, otherwise the same 
value as in state.txt's timestamp=... field;


a 64bit integer osmosis_replication_sequence_number for the 
replication sequence number (sequenceNumber=... in the state.txt file) 
which is, in practice, not required as Marqqs has explained but makes 
things easier for Osmosis, as Brett has explaiend;


a variable lenth string osmosis_replication_base_url that points to 
the directory from where replication files are loaded (baseUrl=... in 
configuration.txt).


It may make sense to have a start timestamp and start replication number 
in there as well but I don't have an immediate use case so I'm happy to 
defer that until there is one.


I've sent you a pull request on GitHub for this change but I'd like to 
stress again that I wouldn't mind if it were done differently, with 
other fields, other types, other IDs - main thing for me is that you 
give it the nod and add it to your OSM-Binary repo which I consider to 
be the official one. Once the stuff is in there I can go on and make 
patches for programs that use PBF files in some way. (Not sure if I'll 
come as far as Osmosis but we'll see.)


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-03 Thread Jochen Topf
On Mon, Dec 03, 2012 at 09:54:59PM +0100, Frederik Ramm wrote:
 On 22.11.2012 00:18, Scott Crosby wrote:
 I think for Frederik's immediate needs, we should add a have a field
 called osmosis_replication_timestamp or osmosis_replication_state = 32,
 which contains a submessage containing a replication timestamp and other
 replication data that he feels is appropriate.
 
 As for the timestamp =18 field, Dennis, what was your intended use of
 this field?  Marqqs, what is the intended use of your timestamp
 optional_features field?
 
 Since nobody has come forward with further requests, may I humbly
 suggest that we add three new fields:
 
 a 64bit integer osmosis_replication_timestamp for the replication
 timestamp, expressed in seconds since the epoch, otherwise the same
 value as in state.txt's timestamp=... field;

Why the osmosis in there? That seems rather strange to me. Either it is some
general thing that works with all programs, then it shouldn't be named after a
specific program. Or it is not, then it shouldn't be in a general file standard.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-03 Thread Frederik Ramm

Hi,

On 03.12.2012 22:27, Jochen Topf wrote:

Why the osmosis in there? That seems rather strange to me. Either it is some
general thing that works with all programs, then it shouldn't be named after a
specific program. Or it is not, then it shouldn't be in a general file standard.


It is the replication technology used by Osmosis on the server side. It 
works with all programs that use the Osmosis algorithm. It doesn't work 
with every thinkable replication mechanism because those might require 
other data. Trying to invent something future proof seldom works.


For example, the way the directories are structured below the 
replication URL 
(http://planet.openstreetmap.org/replication/minute/000/118/578.osc.gz) 
is something specific to the way Osmosis handles its replication; a 
program that consumes these files needs knowledge about that.


If you wanted to encode some kind of generic replication information 
then you'd probably boil it down to a simple string field called 
replication_information and that would then contain something like 
replication_type=osmosis sequence_number=1234 
url=http://something/replication/minute; or so.


That would be possible, but it would force every single writer/consumer 
of these files to serialize/deserialize the replication information 
string (tabs or spaces? spaces allowed after the equal sign or not? 
order significant? type=osmosis or type=Osmosis? ...) - making them 
top-level fields saves us from that.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-12-03 Thread Jochen Topf
On Tue, Dec 04, 2012 at 12:09:16AM +0100, Frederik Ramm wrote:
 On 03.12.2012 22:27, Jochen Topf wrote:
 Why the osmosis in there? That seems rather strange to me. Either it is 
 some
 general thing that works with all programs, then it shouldn't be named after 
 a
 specific program. Or it is not, then it shouldn't be in a general file 
 standard.
 
 It is the replication technology used by Osmosis on the server side.
 It works with all programs that use the Osmosis algorithm. It
 doesn't work with every thinkable replication mechanism because
 those might require other data. Trying to invent something future
 proof seldom works.
 
 For example, the way the directories are structured below the
 replication URL 
 (http://planet.openstreetmap.org/replication/minute/000/118/578.osc.gz)
 is something specific to the way Osmosis handles its replication; a
 program that consumes these files needs knowledge about that.
 
 If you wanted to encode some kind of generic replication information
 then you'd probably boil it down to a simple string field called
 replication_information and that would then contain something like
 replication_type=osmosis sequence_number=1234
 url=http://something/replication/minute; or so.
 
 That would be possible, but it would force every single
 writer/consumer of these files to serialize/deserialize the
 replication information string (tabs or spaces? spaces allowed after
 the equal sign or not? order significant? type=osmosis or
 type=Osmosis? ...) - making them top-level fields saves us from
 that.

That still isn't specific to Osmosis. Somebody else could implement this
algorithm. Markus seems to have done so, albeit a bit differently. The
algorithm should be documented somewhere and if you think there can be
other algorithms, maybe this one should get a name. But it should not be
named after one of the programs that happen to implement it.

This is a similar issue as with the main OSM map, which was named Mapnik
after the rendering program which lead to no end of confusion.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-24 Thread Brett Henderson
Hi Markus,

On 24 November 2012 00:04, mar...@gmx.eu wrote:

 Hi Brett,

  *If* this information is intended to be used as an input into replication
  processes then the sequence number is essential.  Osmosis writes a
  timestamp in the state.txt file, but it only for identifying the right
  sequence number to begin replication with.  All replication processing
  requires the sequence number.  Attempting to use a timestamp is
  theoretically possible but it's much less efficient and not how it was
  supposed to work.

 I think this is true for database based updates, however the sequence
 number is not really needed for file based updates we're presently talking
 about:

 For example, osmupdate downloads all change files, starting with the
 newest, going back in time until the the change file has been downloaded
 which is newer than the planet files timestamp. Then all these change files
 are merged to one big change file which is then applied to the planet file.


Yep, that will work for patching planet files.  The replication tasks in
Osmosis can't operate that way though.

The existing --read-replication-interval allows limits to be specified to
restrict the amount of changesets downloaded at a time.  This allows a
local database to catch up in smaller steps if it is a long way behind.
Catching up in smaller steps is preferable in this case because it deals
better with the odd failure in processing (it's very frustrating to
download weeks of changes only to fail near the end and have to start
again), and because it prevents transaction sizes from growing unbounded.
Having to wait several days for one huge catchup transaction to be
processed is far from ideal, it's preferable to catch up in smaller steps.

For patching planet files it's less of an issue because you'll almost
always want all available changes to be applied, and because the number of
files being downloaded will be much less (you'll typically be using daily
or hourly files, not minute files) therefore you'll be less likely to run
into an intermittent network connectivity problem, and patching a file is
extremely unlikely to throw errors unless you run out of disk space or have
a system crash.

One other thing worth mentioning is that timestamps are not guaranteed to
increase for every change file.  In practice for anything down to minute
files you're unlikely to see any issues, but if the database server clock
skews for any reason there's nothing to prevent time running backwards.
This could lead to consumers relying on timestamps to miss data.  Sequence
numbers on the other hand are guaranteed to always increase per change file.

This is all a bit academic for patching planet files, but Osmosis doesn't
make any assumptions about how short the changeset intervals are, or what
is consuming changes at the other end of the pipeline.

I could create a new task optimised for patching planet files, and perhaps
that's what I (or somebody else if they wish to step in) will need to do if
we embed replication information into PBF files, but it will have to remain
separate from --read-replication-interval, so there'll be more code to
maintain.  I'm not opposed to it if it makes users lives easier though.

In summary, I'd prefer to keep using sequence numbers if possible because
it allows me to re-use more existing replication code, but it wouldn't be
impossible to do without them.


 Osmosis may work differently, and it may need the sequence number to start
 this kind of file update - I really don't know. But if so, I totally agree,
 we should make it possible to store sequence numbers in PBF files.

 Could also be done with the key-val format I suggested...


Cool, I don't have any strong opinions on how the information should be
stored.  I'm happy to leave that in the hands of those more familiar with
the PBF format.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-24 Thread Scott Crosby
On Fri, Nov 23, 2012 at 5:03 AM, mar...@gmx.eu wrote:

 Hi Scott,

 in brief to the 1-degrees granularity:

 1. Do whole processing in 64 bit:
 This would mean to need much more RAM space when processing ways'
 coordinates. We should not do this unless this granularity is really
 required.


If you want your program to do all processing with 100 nanodegree
granularity instead of 1 nanodegree granularity, then you can use ints
throughout. Your software will have the limitation that if a PBF file
contains data with 1 nanodegree granularity that there will be data loss,
which is probably not a limitation in practice. AFAIK, there are no PBF
files with granularity that is not a multiple of 100 or with lat_offset and
lon_offset != 0.



 2. Your formula:
   latitude_int = ((lat_offset + granularity*lat)/50+1)/2
 Good idea, but again, this would mean one more multiplication, one more
 division (and two additions, one shift). These operations usually can be
 done in no time, however that's different if you need to do them a Billion
 times.


I'm curious, have you benchmarked the difference?

There are still people out there who have 32 bit machines, I presume they
 do not have 64 bits hardware multiplication units, hence the processing
 time will increase.


In any case, if the file has a granularity that is a multiple of 100,  you
can use this specialized formula instead:
   latitude_int = (lat_offset/50+1)/2 + (granularity/100)*lat // This
calculation can be done using 32-bit ints.

This can be further specialized for when the granularity is 100 to:
   latitude_int = (lat_offset/50+1)/2 + lat // This calculation can be done
using 32-bit ints.


 3. Process sequence:
 Using the granularity factor, lon/lat of every node in an OSMData
 fileblock must be read, stored temporarily and transformed later. Thus you
 have to access every data twice: first to read it, and a second time when
 you transform its granularity. This might be a flaw in PBF data model...
 Could we at least change this in that manner that the granularity
 information comes _before_ the real data? Same applies to lon/lat offset
 and date granularity.


No can do. Google's protobuf format doesn't specifify the order in which
the components of a message are serialized (this is to support
concatenation of messages without decoding them). Their implementation
serializes in tag-order, and I chose larger numbers for the granularity
tags than for the primitive block tags.



 In the end - there always will be a lot of programs which do not need this
 quasi optional feature granularity and simply will not support it.



 Metadata...

 We had the same discussion a year ago. Do you remember?
 https://wiki.openstreetmap.org/wiki/Talk:PBF_Format#File_Timestamp.3F
 I'm curious if - and I hope that - we manage to extend the PBF data format
 this time. :-)


 The file time stamp I added was meant as an interim solution: I took the
 already defined optional feature and stored a key-val pair in it, for
 example timestamp=2011-10-16T15:45:00Z.

 I think this example shows what we really need: a flexible format for file
 related meta data. With key-val pairs, everyone could add optional data
 whenever they are needed in a toolchain. This is the flexibility we are
 used to have from OSM XML format.


I understand the desire for this, but I want to put some thought into it to
avoid the situation that created this thread, where the same metadata is
stored in different locations, and in different formats.

How about two types of metadata storage, one type is standardized in the
OSMHeader object directly:


message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;
  /* Other ad-hoc metadata */

  repeated AdHocMetadata adhoc_metadata = 6; // See below.


  optional string writingprogram = 16;
  optional string source = 17; // From the bbox field.

  optional string timestamp = 18; // from OSM planet header.

  optional int64 replication_timestamp = 19 // In microseconds since 1970 UTC.

  optional string copyright = 20;

  optional string contributors = 21;

  optional string license = 22;

}


(new fields taken from the new planet header). Question, since I haven't
reviewed OSM replication options, do we want one timetsamp, two timestamps,
and should they be fnt64 or string?


 To combine this flexibility with the advantages of Protobuf format
 (compressed storage of different data types) we need to allow meta
 formatted objects - or something like this:

 message HeaderBlock {
   ...
   repeated HeaderMeta = 20;
 }

 message HeaderMeta {
   required string HeaderKey = 1;
   optional HeaderMetaVarint = 10;
   optional HeaderMetaString = 12;
 // see type definitions there:
 https://wiki.openstreetmap.org/wiki/PBF#Format_example
 // Only _one_ of the three optional objects should be used; did not know
 how to define this in Protobuf without 

Re: [OSM-dev] Timestamp in PBF files

2012-11-23 Thread Paul Norman
 From: Jochen Topf [mailto:joc...@remote.org]
 Sent: Thursday, November 22, 2012 8:19 AM
 Subject: Re: [OSM-dev] Timestamp in PBF files
 
 I don't know why there are no redacted nodes, Matt mentioned something
 that he hasn't implemented that yet. But that would mean we have non-
 ODbL-clean data in the full history dump. Frankly this gets all a bit
 too confusing for me. I hope the people who have implemented these
 things will at some point document them and/or fix those cases.

It's also possible that the redacted nodes aren't included in the dump at
all.

Could you check for version 1 of node 551550983? It's a random redacted
node.

If it's present with positional information then the file isn't ODbL clean
but if it's completely missing then it's a documentation issue.


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-23 Thread marqqs
Hi Scott,

in brief to the 1-degrees granularity:

1. Do whole processing in 64 bit:
This would mean to need much more RAM space when processing ways' coordinates. 
We should not do this unless this granularity is really required.

2. Your formula:
  latitude_int = ((lat_offset + granularity*lat)/50+1)/2
Good idea, but again, this would mean one more multiplication, one more 
division (and two additions, one shift). These operations usually can be done 
in no time, however that's different if you need to do them a Billion times.
There are still people out there who have 32 bit machines, I presume they do 
not have 64 bits hardware multiplication units, hence the processing time will 
increase.

3. Process sequence:
Using the granularity factor, lon/lat of every node in an OSMData fileblock 
must be read, stored temporarily and transformed later. Thus you have to access 
every data twice: first to read it, and a second time when you transform its 
granularity. This might be a flaw in PBF data model... Could we at least change 
this in that manner that the granularity information comes _before_ the real 
data? Same applies to lon/lat offset and date granularity.

In the end - there always will be a lot of programs which do not need this 
quasi optional feature granularity and simply will not support it.


Metadata...

We had the same discussion a year ago. Do you remember?
https://wiki.openstreetmap.org/wiki/Talk:PBF_Format#File_Timestamp.3F
I'm curious if - and I hope that - we manage to extend the PBF data format this 
time. :-)

The file time stamp I added was meant as an interim solution: I took the 
already defined optional feature and stored a key-val pair in it, for example 
timestamp=2011-10-16T15:45:00Z.

I think this example shows what we really need: a flexible format for file 
related meta data. With key-val pairs, everyone could add optional data 
whenever they are needed in a toolchain. This is the flexibility we are used to 
have from OSM XML format.

To combine this flexibility with the advantages of Protobuf format (compressed 
storage of different data types) we need to allow meta formatted objects - or 
something like this:

message HeaderBlock {
  ...
  repeated HeaderMeta = 20;
}

message HeaderMeta {
  required string HeaderKey = 1;
  optional HeaderMetaVarint = 10;
  optional HeaderMetaString = 12;
// see type definitions there: 
https://wiki.openstreetmap.org/wiki/PBF#Format_example
// Only _one_ of the three optional objects should be used; did not know how to 
define this in Protobuf without an additional hierarchy layer.
}

What do you think about this suggestion?

Markus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-23 Thread marqqs
 message HeaderMeta {
   required string HeaderKey = 1;
   optional HeaderMetaVarint = 10;
   optional HeaderMetaString = 12;
 // see type definitions there:
 https://wiki.openstreetmap.org/wiki/PBF#Format_example
 // Only _one_ of the three optional objects should be used; did not know
 how to define this in Protobuf without an additional hierarchy layer.
 }


Sorry, I meant Only _one_ of the two - not tree.

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-23 Thread Brett Henderson
On 21 November 2012 19:43, Frederik Ramm frede...@remote.org wrote:

 Hi,

snip


 To be self-contained, it should be sufficient to include the baseURL
 from configuration.txt, no?

 So maybe:


   optional string writingprogram = 16;
   optional string source = 17;
   optional sint64 timestamp = 18;
   optional sint64 replication_timestamp = 19;
   optional string replication_url = 20;

 I don't know if the sequenceNumber from state.txt adds any value, if it
 does then one could throw that in as well.


I've been explicitly cc'd on the original message so I should put in an
appearance ;-)

*If* this information is intended to be used as an input into replication
processes then the sequence number is essential.  Osmosis writes a
timestamp in the state.txt file, but it only for identifying the right
sequence number to begin replication with.  All replication processing
requires the sequence number.  Attempting to use a timestamp is
theoretically possible but it's much less efficient and not how it was
supposed to work.

However, utilising this new sequence number in Osmosis will require some
significant changes.  The current task that figures out what changes to
download (ie. --read-replication-interval) is totally independent of the
task that applies changes to a snapshot (ie. --apply-change).  The simplest
solution would be to write an uber task that is specifically aimed at
patching planet files, but it will be an all-in-one task that can't be
combined with others.  It *may* be possible to modify pipeline
initialisation to allow all tasks to synchronise replication numbers before
beginning processing, but that will be a lot more complicated.

Updating the timestamp and sequence number after processing will also
require some changes because it impacts a number of tasks.  All tasks will
have to propagate the field (shouldn't be too difficult), but tasks such as
--apply-change will need to be smart about which input source they use as
the source of truth for the sequence number.  It's all possible, but not a
trivial change.

Perhaps this is a non-issue if everybody uses osmupdate these days anyway
:-)

As for the PBF format itself, I don't have any opinions.  I'm more than
happy for those who are more familiar with it to come up with a solution.
I'll do my best to accommodate it.

Brett
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-22 Thread Jochen Topf
On Wed, Nov 21, 2012 at 05:16:12PM -0600, Scott Crosby wrote:
 On Wed, Nov 21, 2012 at 3:46 AM, Jochen Topf joc...@remote.org wrote:
 
  On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:
   Not quite. The granularity of timestamps can go down to the milliseconds.
  
  
  https://github.com/DennisOSRM/OSM-binary/blob/master/src/osmformat.proto#L96
 
  Ugh. Yes. That was always somewhat of a problem in the protocol IMHO.
  Nobody
  needs more granularity than seconds because the main database doesn't have
  it.
  Similar for the latitude/longitude granularity. Nobody uses that. And it
  just
  makes all the code reading PBF files a bit more complex and a bit slower.
 
 
 Today the database lacks those features, but the future can be different.
 The trivial complexity of that feature in readers allows many possible
 future features, without a breaking format change. The ones I had in mind
 were:
 
 Lower granularity makes it easy to create lower-precision excerpts that
 are smaller to send and easier to store.
 Allow OSM tooling to handle contour lines, or other grid-specified
 data, where making the granularity size matching the grid size can lead to
 vastly improved compression.
 Support future higher-precision data, e.g., generated from GPS block
 3 satellites.
 Millisecond timestamps are much easier to use as unique changeset ID's
 than second-granularity timestamps.

On the other hand it is rather unlikely that OSM will make those changes to its
database anytime soon or that PBF is used for non-OSM data like contour lines
(because there are better formats and tools for that). Having functionality
that nobody actually uses means it is probably not implemented universally and
properly (Markus already mentioned he doesn't implement them). In the best
case software that doesn't implement it at least checks for it and complains,
in the worse case there is some buggy code that never gets checked because
nobody ever uses it so that if and when we actually use those features we
can't rely on the software anyway. And we have changed the PBF format before
and are in the process of changing it again, so it is not such a big deal to
add support for these things later if they are actually needed.

Oh well, this is rather academic, because I am not proposing we change the
format now. I'd only do that if we have a larger overhaul of the format.

 The runtime cost of this is a couple of multiplications that loop-invariant
 code motion can remove; about 30 nanoseconds for each 8000 entity block,
 and is much much cheaper than the branch prediction failures of VarInt
 decoding.

I use ints internally in Osmium for the lon/lat as does PBF. But there is this
conversion in there and depending on the granularity factor I am not sure I can
actually do that using just integers. I don't want to use doubles though. So
this might break on some granularity factors, I don't know and I never tested
it.  I actually use a int to double conversion before the factor is applied and
later convert back to int. And in the usual case for OSM I don't do this double
conversion at all, I just use the int as is because it has the right
granularity factor anyway. This extra check (one if that can be perfectly
branch predicted because it never changes) makes the reading of the whole PBF
file about 1% faster! double/int-conversions are slow. So even this seemingly
small thing mean I spent too much time thinking about it and writing code I am
not sure is perfectly right. :-(

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-22 Thread Dennis Luxen
 As for the timestamp =18 field, Dennis, what was your intended use
 of this field?  Marqqs, what is the intended use of your timestamp 
 optional_features field?
 
 By this, I mean, what semantics are you attaching to these
 timestamps. I think its perfectly reasonable to have several
 timestamp fields, perhaps: The timestamp the file was generated. The
 state needed to resume replication of an extract/planet (which 
 contains an internal timestamp)? The timestamp of the when the file
 was extracted/excerpted?

The main purpose was to store the replication state.

--Dennis

 If you two could give me a better idea of what your timestamps are
 used for, I could advise on how we can try to integrate them into one
 or more standard timestamp fields. And after that, we can then figure
 out how we might want to assign timestamps to field names/ids ---
 keeping in mind prior uses of those field names and numbers.
 



___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-22 Thread marqqs




Hello Scott,

Thanks for your reply! I think what we need is a replacement for the timestamp 
which has been provided by .osm.bz2 files for years now. For example:


$ wget -q planet.openstreetmap.org/planet/planet-latest.osm.bz2 -O - | bunzip2 
| head -4 

?xml version=1.0 encoding=UTF-8?
osm version=0.6 generator=OpenStreetMap planet.c copyright=OpenStreetMap 
and contributors attribution=http://www.openstreetmap.org/copyright/; 
license=http://opendatacommons.org/licenses/odbl/1.0/; 
timestamp=2012-11-14T01:10:07Z
  bound box=-90,-180,90,180 origin=http://www.openstreetmap.org/api/0.6; /
  node id=3 lat=50.1240327 lon=14.4524155 
timestamp=2012-07-24T12:48:39Z version=7 changeset=12465837 user=OSMF 
Redaction Account uid=722137/


As you can see, there is a timestamp=2012-11-14T01:10:07Z which states the 
replication time - as far as I know.

The PBF formatted planet file form the same day lacks this information. Thus, 
people who need this timestamp cannot use the PBF planet but are forced to 
download the old bzipped XML planet file (or to look for a suitable state.txt).

My goal - and presumably Frederik's as well - is to eliminate this disadvantage 
of PBF formatted files.

I have no objections to code this timestamp as signed Varint with id 32. This 
should result in two bytes (0x80 0x02) when PBF-coded.

May I add this to the OSM Wiki page?

Markus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-22 Thread Jochen Topf
On Wed, Nov 21, 2012 at 07:00:32PM +0100, Frederik Ramm wrote:
 On 11/21/12 18:46, Jochen Topf wrote:
 On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:
 How many nodes in the planet lack a latitude or longitude? Using a MAXINT
 encoding will cost about 8 bytes for each missing latitude or longitude.
  It's possible to reduce this to 2-3 bytes, but the format gets
 uglier/hackier. IMHO, probably not worth that cost.
 
 I just counted those cases. In the history dump from October 2012 there are
 2344 nodes without coordinates. Hardly worth thinking about...
 
 That sounds implausibly low.
 
 Given that
 
 1. every deleted node should be in that file without coordinates
 2. we're currently at node id 2.03 billion,
 3. there are 1.66 billion visible nodes in the database
 
 we should have something like 370 million deleted nodes.
 
 Hm, we probably have to remove from that number those nodes that
 were deleted in ancient times where we've meanwhile dropped the
 history, and maybe some from the first TIGER import where we
 manually removed them from the database, but still - at least every
 node deleted in the past couple of years *should* show up with
 visible=false in the full history dump, and any node with
 visible=false *should* not have coordinates.
 
 Either there's an error in my thinking, or in your count, or in the
 script that does the history export ;)

I checked this in some more detail. The cases I found were cases from years ago
(last is from May 2008). Apparently the OSM server did not check coordinates
for validity back then. So all these nodes were in the database and lat and/or
lon happened to have the MAXINT value I use to signify undefined coordinates.
Of course they should never have had those values, but they did. So these cases
are not the redacted node coordinates.

I don't know why there are no redacted nodes, Matt mentioned something that he
hasn't implemented that yet. But that would mean we have non-ODbL-clean data in
the full history dump. Frankly this gets all a bit too confusing for me. I
hope the people who have implemented these things will at some point document
them and/or fix those cases.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-22 Thread Scott Crosby

 And we have changed the PBF format before

and are in the process of changing it again, so it is not such a big deal to
 add support for these things later if they are actually needed.


One of my goals was to reduce breaking changes, or making files that a
program thinks it can read, but can't actually read. (e.g., history dumps)



 I use ints internally in Osmium for the lon/lat as does PBF. But there is
 this
 conversion in there and depending on the granularity factor I am not sure
 I can
 actually do that using just integers. I don't want to use doubles though.


All units in PBF are in nano-degrees, so you can always use longs to do
your calculation, as long as you do the right casts so that the arithmetic
is done in longs instead of possibly overflowing ints.

So
 this might break on some granularity factors, I don't know and I never
 tested
 it.  I actually use a int to double conversion before the factor is
 applied and
 later convert back to int. And in the usual case for OSM I don't do this
 double
 conversion at all, I just use the int as is because it has the right
 granularity factor anyway. This extra check (one if that can be perfectly
 branch predicted because it never changes) makes the reading of the whole
 PBF
 file about 1% faster! double/int-conversions are slow. So even this
 seemingly
 small thing mean I spent too much time thinking about it and writing code
 I am
 not sure is perfectly right. :-(


Reading a PBF file into code that uses 32-bit integers to represent
latitudes and longitudes is probably safe on all current PBF files, but is
potentially lossy operation; a latitude in in a 32-bit integer is only
precise to 100 nanodegrees. if the PBF file happens to have measurements
precise to 1 nanodegree, you must lose 2 digits of precision.

Here is an alternate formula that only requires integer arithmatic that
will go from a PBF file to a 32-bit integer and is correct for any
granulatity.

  long lat =  // Latitude encoded in the pbf. type must be a 64-bit
int to avoid overflow in calculation.
  latitude_int = ((lat_offset + granularity*lat)/50+1)/2 // This
calculation must be done with 64-bit longs.

This formula will be correct for any granularity and lat_offset . The
reason for the $/50+1)/2$ instead of $/100$ is to get better round-off
behavior; it'll round-nearest instead of round-to-zero.
http://en.wikipedia.org/wiki/Rounding

If the granularity is 100, or any multiple of 100 (e.g., 200, 1000, 1,
700), you can simplify the above formula into:
  int lat =  // This can be an 32-bit int without overflow.
  latitude_int = (lat_offset/50+1)/2 + (granularity/100)*lat // This
calculation can be done using 32-bit ints.

I don't want to put these formulas as part of the spec as they are the
least-lossy approximations of the lossless formulas in the specification.

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Frederik Ramm

Hi,

On 11/21/2012 04:27 AM, Scott Crosby wrote:

Idea, why not put the entire state.txt file into the OSMHeader block?


I tend to view the structure of state.txt as an Osmosis implementation 
detail and I'm not sure if it would be a good idea to require that PBF 
parsers not only decipher the PBF, but also have knowledge about how 
Osmosis builds its state.txt files.



One thing I don't like about it is that the state.txt file is not
self-contained:

#Tue Nov 20 19:02:18 UTC 2012
sequenceNumber=1668
timestamp=2012-11-20T19\:00\:00Z

It should have a planet URI (or a planet URI and a list of mirrors) of what 
planet it corresponds to. That way a user merely needs to say 'update planet' 
and everything else can be automated.


To be self-contained, it should be sufficient to include the baseURL 
from configuration.txt, no?


So maybe:

  optional string writingprogram = 16;
  optional string source = 17;
  optional sint64 timestamp = 18;
  optional sint64 replication_timestamp = 19;
  optional string replication_url = 20;

I don't know if the sequenceNumber from state.txt adds any value, if it 
does then one could throw that in as well.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Gregory Williams
 -Original Message-
 From: Frederik Ramm [mailto:frede...@remote.org]
 Sent: 21 November 2012 08:44
 To: Scott Crosby
 Cc: dev@openstreetmap.org
 Subject: Re: [OSM-dev] Timestamp in PBF files
 
[Snip]
 
 To be self-contained, it should be sufficient to include the baseURL
 from configuration.txt, no?
 
 So maybe:
 
optional string writingprogram = 16;
optional string source = 17;
optional sint64 timestamp = 18;
optional sint64 replication_timestamp = 19;
optional string replication_url = 20;
 
 I don't know if the sequenceNumber from state.txt adds any value, if it
does
 then one could throw that in as well.

I think including the sequenceNumber will be useful for making it easy to
determine where to continue replication from once the PBF file is processed.
Just to clarify that the replication_url will need to include the minute /
hour / day as appropriate for the sequenceNumber to apply to the appropriate
sequence, i.e. from the configuration.txt like you say.

Gregory


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Jochen Topf
On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:
 Not quite. The granularity of timestamps can go down to the milliseconds.
 
 https://github.com/DennisOSRM/OSM-binary/blob/master/src/osmformat.proto#L96

Ugh. Yes. That was always somewhat of a problem in the protocol IMHO. Nobody
needs more granularity than seconds because the main database doesn't have it.
Similar for the latitude/longitude granularity. Nobody uses that. And it just
makes all the code reading PBF files a bit more complex and a bit slower.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Jochen Topf
On Tue, Nov 20, 2012 at 08:40:39PM +0100, Frederik Ramm wrote:
 On 20.11.2012 20:12, Jochen Topf wrote:
 I guess the timestamp is somehow supposed to say which state of the OSM
 database this file represents.
 
 Yes. Bascially whatever was in Osmosis' state.txt file at the time
 this file was created.

Thats not a definition. I create PBF files all the time without a state.txt
file around. 

 How is it supposed to work in history files?
 
 I think it would make sense to have a a comparable timestamp in
 history files. Currently there's no software that would be able to
 patch history files with freshly downloaded diffs so the discussion
 is rather academic though.

Sure. Osmium can do that.

 Do we need two timestamps to define a range for history files?
 
 I'd suggest to wait until someone has an application that needs
 this. I'm a bit wary of throwing this discussion wide open because
 before too long we'll have all sorts of people suggesting helpful
 optional enhancements to the PBF format (while we're at it, can we
 maybe do X) and then nothing gets done again.
 
 All *I* want is one extra timestamp, and I would start using it
 tomorrow, it's not academic, there's software that would process it,
 there's a clear benefit to users. I'd prefer to use a standard but
 in the absence of an existing standard I'll just make something up
 and use that. (But we've seen how well that works - I did make
 something up for testing and Marqqs expected something else.)

And what I am saying is that we should think this through so that we don't
have the same problem again tomorrow.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Frederik Ramm

Hi,

On 11/21/2012 10:42 AM, Jochen Topf wrote:

Yes. Bascially whatever was in Osmosis' state.txt file at the time
this file was created.


Thats not a definition. I create PBF files all the time without a state.txt
file around.


Then copy the timestamp from the input PBF, or if you don't have an 
input PBF or that doesn't have a timestamp, leave it out. The timestamp 
I'm after is not some generic timestamp that you can make up, it must 
always refer to a replication process. No replication process - no 
timestamp. Therefore it is probably a good idea to make that clear in 
the field name - not timestamp but replication_timestamp or so.



And what I am saying is that we should think this through so that we don't
have the same problem again tomorrow.


Then please think it through quickly and post the results ;)

Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread marqqs
Hello,

 How many nodes in the planet lack a latitude or longitude? Using a MAXINT
 encoding will cost about 8 bytes for each missing latitude or longitude.
  It's possible to reduce this to 2-3 bytes, but the format gets
 uglier/hackier. IMHO, probably not worth that cost.

As far as I understood, only nodes with the attribute action=delete do not have 
(resp. do not need) lon/lat. On the other hand, it does not hurt to give them 
false lon/lat values. This is what osmconvert does when you apply the 
--fake-lonlat option.

In PBF, lon/lat are delta coded, aren't they? Thus it would be best to write a 
delta of 0, i.e., to take the logical value of the previous node. A few steps 
later in the toolchain lon/lat values of action=delete objects will be deleted 
anyway (together with their objects).

 It should have a planet URI (or a planet URI and a list of mirrors) of
 what planet it corresponds to. That way a user merely needs to say
 'update planet' and everything else can be automated.

Please don't. These data aren't necessary. Same applies to sequence numbers.

Since a year or so planet files can be updated by a single update command. 
This command first determines the age of the old file, then it downloads all 
needed planet change files, starting with the newest and ending with that 
change file which has been published right after the file timestamp of the old 
planet file.

Syntax:
https://wiki.openstreetmap.org/wiki/Osmupdate#Updating_OSM_Files

Since the state.txt files from osm planet's server have to be parsed in the 
process anyway, there is no need to include them into PBF.

 No status, but if anyone wants my opinion, when authoring the format, I
 expected us to add metadata to planets, and expected it to be put into
 OSMHeader as in the OSRM clone you linked to above. I would vote to
 deprecate the use of the ISO timestring encoded into the optional_features
 array, but continue to write to it to avoid breaking old installs of
 Marqqs's tools.

OK, this seems to be consensual: PBF id 18 in the header block for a signed int 
UNIX timestamp value.

I will implement the appropriate read function in osmconvert at once.

For reason of compatibility osmconvert will _write_ both file timestamp 
representations, the UNIX based _and_ the string based. There may be some tools 
which depend on the format we have used for a year now.

 Ugh. Yes. That was always somewhat of a problem in the protocol IMHO.
 Nobody
 needs more granularity than seconds because the main database doesn't have
 it.
 Similar for the latitude/longitude granularity. Nobody uses that. And it
 just
 makes all the code reading PBF files a bit more complex and a bit slower.

I totally agree. osmconvert even cannot read any PBF files which do not use 
standard granularity. It rejects these files with an error message. No one has 
ever complained! Thus I guess nobody really needs this option.

Besides, the format definition we have is kind of unfortunate: the granularity 
values may come _after_ the lon/lat values they refer to. This makes it 
necessary to process every data in a data block twice: first parse it and - in 
a second run - apply the granularity factor.

  And what I am saying is that we should think this through so that we
 don't
  have the same problem again tomorrow.
 
 Then please think it through quickly and post the results ;)

Done. Any objections? ;-)

Markus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Frederik Ramm

Hi,

On 11/21/2012 11:50 AM, mar...@gmx.eu wrote:

It should have a planet URI (or a planet URI and a list of mirrors) of
what planet it corresponds to. That way a user merely needs to say
'update planet' and everything else can be automated.


Please don't. These data aren't necessary. Same applies to sequence numbers.

Since a year or so planet files can be updated by a single update command. 
This command first determines the age of the old file, then it downloads all needed 
planet change files,


... by making hard-coded assumptions about where to get them from, and 
that's precisely what Scott meant. After all, there might be other 
projects using the OSM toolchain (e.g. fosm.org) and they publish their 
own diffs and might publish their own PBF files, and if you use their 
PBF file and try to update that from some openstreetmap.org URL that 
won't work.


So if you really want to be able update the file without relying on some 
out-of-band knowledge (I downloaded this file from Geofabrik and I 
happen to know that they use openstreetmap.org as their data source), 
then you would need the URI in the file.


The same if someone were to operate an OSM mirror and publish their own 
diffs, and you might choose to synchronize with them rather than with 
openstreetmap.org - even here, what the mirror publishes in a diff with 
the time stamp X is not necessarily identical with what OSM publishes in 
a diff with the same time stamp, and knowledge about where to update the 
file from would be essential.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Frederik Ramm

Hi,

On 11/21/2012 11:50 AM, mar...@gmx.eu wrote:

OK, this seems to be consensual: PBF id 18 in the header block for a signed int 
UNIX timestamp value.


In both his messages, Scott had suggested PBF id 18 for a signed int 
epoch value of the file creation, not for a signed int epoch value of 
the replication state.


It would probably be premature to call this a consensus for a 
replication state timestamp at PBF id 18.


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Jochen Topf
On Wed, Nov 21, 2012 at 11:50:38AM +0100, mar...@gmx.eu wrote:
  How many nodes in the planet lack a latitude or longitude? Using a MAXINT
  encoding will cost about 8 bytes for each missing latitude or longitude.
   It's possible to reduce this to 2-3 bytes, but the format gets
  uglier/hackier. IMHO, probably not worth that cost.
 
 As far as I understood, only nodes with the attribute action=delete do not 
 have (resp. do not need) lon/lat. On the other hand, it does not hurt to give 
 them false lon/lat values. This is what osmconvert does when you apply the 
 --fake-lonlat option.
 
 In PBF, lon/lat are delta coded, aren't they? Thus it would be best to write 
 a delta of 0, i.e., to take the logical value of the previous node. A few 
 steps later in the toolchain lon/lat values of action=delete objects will be 
 deleted anyway (together with their objects).

You only have missing lon/lat in OSM files with history. And presumably you
use them because you want to know when what objects were created and deleted
and so on. So you can not just ignore deleted objects. And you want to know
whether an object had no lon/lat as compared to the lon/lat of the object
that happened to be right before it in the file. So your solution doesn't
work.

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread marqqs
Frederik, Jochen,

sorry, you both are right, I really was too fast.

But now?

Please, let's risk one small step and standardize the file timestamp 
(replication time), whatever the protobuf ID will be. If not 18, then 19 or 
something else. Protobuf format is flexible enough to be extended again at any 
time.

After this, we can continue caring about other file related meta data.

Furthermore, we can think about introducing a new (or extended) dense node 
format for history files.

Step by step...

Markus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Jochen Topf
On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:
 How many nodes in the planet lack a latitude or longitude? Using a MAXINT
 encoding will cost about 8 bytes for each missing latitude or longitude.
  It's possible to reduce this to 2-3 bytes, but the format gets
 uglier/hackier. IMHO, probably not worth that cost.

I just counted those cases. In the history dump from October 2012 there are
2344 nodes without coordinates. Hardly worth thinking about...

Maybe we should just remove them alltogether?

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Frederik Ramm

Hi,

On 11/21/12 18:46, Jochen Topf wrote:

On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:

How many nodes in the planet lack a latitude or longitude? Using a MAXINT
encoding will cost about 8 bytes for each missing latitude or longitude.
 It's possible to reduce this to 2-3 bytes, but the format gets
uglier/hackier. IMHO, probably not worth that cost.


I just counted those cases. In the history dump from October 2012 there are
2344 nodes without coordinates. Hardly worth thinking about...


That sounds implausibly low.

Given that

1. every deleted node should be in that file without coordinates
2. we're currently at node id 2.03 billion,
3. there are 1.66 billion visible nodes in the database

we should have something like 370 million deleted nodes.

Hm, we probably have to remove from that number those nodes that were 
deleted in ancient times where we've meanwhile dropped the history, and 
maybe some from the first TIGER import where we manually removed them 
from the database, but still - at least every node deleted in the past 
couple of years *should* show up with visible=false in the full history 
dump, and any node with visible=false *should* not have coordinates.


Either there's an error in my thinking, or in your count, or in the 
script that does the history export ;)


Bye
Frederik


--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Scott Crosby
On Wed, Nov 21, 2012 at 3:46 AM, Jochen Topf joc...@remote.org wrote:

 On Tue, Nov 20, 2012 at 09:17:59PM -0600, Scott Crosby wrote:
  Not quite. The granularity of timestamps can go down to the milliseconds.
 
 
 https://github.com/DennisOSRM/OSM-binary/blob/master/src/osmformat.proto#L96

 Ugh. Yes. That was always somewhat of a problem in the protocol IMHO.
 Nobody
 needs more granularity than seconds because the main database doesn't have
 it.
 Similar for the latitude/longitude granularity. Nobody uses that. And it
 just
 makes all the code reading PBF files a bit more complex and a bit slower.


Today the database lacks those features, but the future can be different.
The trivial complexity of that feature in readers allows many possible
future features, without a breaking format change. The ones I had in mind
were:

Lower granularity makes it easy to create lower-precision excerpts that
are smaller to send and easier to store.
Allow OSM tooling to handle contour lines, or other grid-specified
data, where making the granularity size matching the grid size can lead to
vastly improved compression.
Support future higher-precision data, e.g., generated from GPS block
3 satellites.
Millisecond timestamps are much easier to use as unique changeset ID's
than second-granularity timestamps.

The runtime cost of this is a couple of multiplications that loop-invariant
code motion can remove; about 30 nanoseconds for each 8000 entity block,
and is much much cheaper than the branch prediction failures of VarInt
decoding.

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-21 Thread Scott Crosby
On Wed, Nov 21, 2012 at 5:26 AM, Frederik Ramm frede...@remote.org wrote:

 Hi,

 On 11/21/2012 11:50 AM, mar...@gmx.eu wrote:

 OK, this seems to be consensual: PBF id 18 in the header block for a
 signed int UNIX timestamp value.


 In both his messages, Scott had suggested PBF id 18 for a signed int epoch
 value of the file creation, not for a signed int epoch value of the
 replication state.

 It would probably be premature to call this a consensus for a replication
 state timestamp at PBF id 18.


I think for Frederik's immediate needs, we should add a have a field called
osmosis_replication_timestamp or osmosis_replication_state = 32, which
contains a submessage containing a replication timestamp and other
replication data that he feels is appropriate.

As for the timestamp =18 field, Dennis, what was your intended use of this
field?  Marqqs, what is the intended use of your timestamp
optional_features field?

By this, I mean, what semantics are you attaching to these timestamps. I
think its perfectly reasonable to have several timestamp fields, perhaps:
   The timestamp the file was generated.
   The state needed to resume replication of an extract/planet (which
contains an internal timestamp)?
   The timestamp of the when the file was extracted/excerpted?

If you two could give me a better idea of what your timestamps are used
for, I could advise on how we can try to integrate them into one or more
standard timestamp fields. And after that, we can then figure out how we
might want to assign timestamps to field names/ids --- keeping in mind
prior uses of those field names and numbers.

Thoughts,
Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Stephan Knauss
Frederik Ramm writes: 

I really don't mind *how* it's done but I would really love to have one 
agreed way to place a timestamp in a PBF instead of everyone rolling their 
own.


I would prefer epoch timestamps. That's a widely accepted way of storing 
time information without the need to worry about time zones and such. 

While we change the header: Could we also include a field to indicate a 
full history planet? After the redaction period the lat/lon is only a 
required field for non-redacted elements.
Is it possible to express this in protobuf? 

If not, it would be fine to have at least a defined value for undefined 
we could document. If I remember correctly Jochen suggested to use MAXINT 
for this. 


Stephan

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Jochen Topf
On Tue, Nov 20, 2012 at 01:51:02PM +0100, Frederik Ramm wrote:
about a year ago, Marqqs tried to have a discussion on how to add
 timestamps to PBF files and hardly anyone was interested.

Before we get into the details of how this timestamp is implemented in the PBF
format, maybe somebody can define what this timestamp is actually timestamping?
Is it the time the file was created? The last changed object in the file? The
time the database extract was created? Something else?

I guess the timestamp is somehow supposed to say which state of the OSM
database this file represents. How is it supposed to work in history files?
Do we need two timestamps to define a range for history files?

Jochen
-- 
Jochen Topf  joc...@remote.org  http://www.remote.org/jochen/  +49-721-388298

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread marqqs
Hello Jochen,

very good question.
From my point of view the file timestamp should be the in-file representation 
of the externally maintained state.txt timestamp, as in
  http://planet.openstreetmap.org/replication/hour/000/001/668.state.txt
for example.

This would it make very easy to update .osm.pbf files on a file basis. You 
would not need to care about externally maintained timestamp files. You would 
just say update this file and the update process could be done automatically.

Regards
Markus

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Frederik Ramm

Hi,

On 20.11.2012 20:12, Jochen Topf wrote:

I guess the timestamp is somehow supposed to say which state of the OSM
database this file represents.


Yes. Bascially whatever was in Osmosis' state.txt file at the time this 
file was created.



How is it supposed to work in history files?


I think it would make sense to have a a comparable timestamp in history 
files. Currently there's no software that would be able to patch history 
files with freshly downloaded diffs so the discussion is rather academic 
though.



Do we need two timestamps to define a range for history files?


I'd suggest to wait until someone has an application that needs this. 
I'm a bit wary of throwing this discussion wide open because before too 
long we'll have all sorts of people suggesting helpful optional 
enhancements to the PBF format (while we're at it, can we maybe do X) 
and then nothing gets done again.


All *I* want is one extra timestamp, and I would start using it 
tomorrow, it's not academic, there's software that would process it, 
there's a clear benefit to users. I'd prefer to use a standard but in 
the absence of an existing standard I'll just make something up and use 
that. (But we've seen how well that works - I did make something up for 
testing and Marqqs expected something else.)


Bye
Frederik

--
Frederik Ramm  ##  eMail frede...@remote.org  ##  N49°00'09 E008°23'33

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Dennis Luxen

Hello,


Yes. Bascially whatever was in Osmosis' state.txt file at the time this
file was created.


This is the use-case I had in mind when experimenting with time-stamps 
in PBF. Updating self-contained PBF files through Osmosis is a major 
advantage to using state.txt files. I, for one, plan to support such a 
time-stamp in OSRM from day one (or two).


--Dennis


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread marqqs
Hello,

did you know that PBF file timestamp has anniversary these days? :-)

https://wiki.openstreetmap.org/w/index.php?title=Talk:PBF_Formatdiff=708490oldid=705430

After some thought...
would it hurt if there were _two_ file timestamps in a PBF file? One 
string-formatted according to the definition from a year ago (see OSM Wiki on 
PBF), and a second one in UNIX time format.

osmconvert would then write _both_ of these timestamps - for reasons of 
compatibility.

Thus, Frederik's new PBF file timestamp could be processed even from now on.

As soon as the decision has been made, one of both file timestamp procedures 
could be removed from the code.

Markus

 Original-Nachricht 
 Datum: Tue, 20 Nov 2012 21:50:26 +0100
 Von: Dennis Luxen dennis.lu...@gmail.com
 An: Frederik Ramm frede...@remote.org
 CC: dev@openstreetmap.org dev@openstreetmap.org, Scott Crosby 
 sc...@sacrosby.com
 Betreff: Re: [OSM-dev] Timestamp in PBF files

 Hello,
 
  Yes. Bascially whatever was in Osmosis' state.txt file at the time this
  file was created.
 
 This is the use-case I had in mind when experimenting with time-stamps 
 in PBF. Updating self-contained PBF files through Osmosis is a major 
 advantage to using state.txt files. I, for one, plan to support such a 
 time-stamp in OSRM from day one (or two).
 
 --Dennis
 
 
 ___
 dev mailing list
 dev@openstreetmap.org
 http://lists.openstreetmap.org/listinfo/dev

___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Stephan Knauss

On 20.11.2012 20:28, mar...@gmx.eu wrote:

Stephan, what did you mean we would need this undefined for?

I should not have mixed topics.

In case you process a full history pbf then it happens that nodes which 
were redacted are stored with MAXINT for lat/lon. This is because 
lat/lon are required fields. Jochen mentioned this a few mails back.


A software reading history PBF might want to handle these elements in a 
special way...


Stephan



___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Scott Crosby
On Tue, Nov 20, 2012 at 1:09 PM, Jochen Topf joc...@remote.org wrote:

 On Tue, Nov 20, 2012 at 06:57:50PM +0100, Stephan Knauss wrote:
  Frederik Ramm writes:
 
  I really don't mind *how* it's done but I would really love to
  have one agreed way to place a timestamp in a PBF instead of
  everyone rolling their own.
 
  I would prefer epoch timestamps. That's a widely accepted way of
  storing time information without the need to worry about time zones
  and such.

 The other timestamps in PBF files (at all the objects) use 64 bit integers
 with
 seconds since epoch. So it would make sense to use the same format.


Not quite. The granularity of timestamps can go down to the milliseconds.

https://github.com/DennisOSRM/OSM-binary/blob/master/src/osmformat.proto#L96


 You can have optional fields in protobuf, but unfortunately this doesn't
 help us in this case. There are two ways nodes can be stored in PBF files:
 as a series of Node objects or as DenseNode objects. Node objects
 have required fields lat and lon. We could change this to be optional.
 There would be a has_lat() or has_lon() call to check for this.

 Unfortunately in most cases the more space efficient DenseNode objects
 are
 used. In this case the latitude and longitude of all nodes of a block are
 stored in a special delta encoding. This doesn't allow for optional fields.
 As far as I can see we could either add a boolean for each node in a block
 that
 defines whether the coordinate field is valid or use a special value for an
 invalid coordinate.


Correct. There is no way in the current DenseNodes format to encode 'no
value' for a latitude or longitude. Changing the message buffer to include,
(say) a boolean array for the hasLatitude()/hasLongitude() would be a
breaking format change, and would add about 18-40 bytes to each block of
8000 nodes.

How many nodes in the planet lack a latitude or longitude? Using a MAXINT
encoding will cost about 8 bytes for each missing latitude or longitude.
 It's possible to reduce this to 2-3 bytes, but the format gets
uglier/hackier. IMHO, probably not worth that cost.
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Scott Crosby
Idea, why not put the entire state.txt file into the OSMHeader block?

/* Contains the file header. */

message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;

  optional string writingprogram = 16;
  optional string source = 17; // From the bbox field.
  optional sint64 timestamp = 18; // Unix-Time encoded into varint.
  optional string osmosis_update_state = 19; // Encoding of the state.txt file.
}


One thing I don't like about it is that the state.txt file is not
self-contained:

#Tue Nov 20 19:02:18 UTC 2012
sequenceNumber=1668
timestamp=2012-11-20T19\:00\:00Z

It should have a planet URI (or a planet URI and a list of mirrors) of
what planet it corresponds to. That way a user merely needs to say
'update planet' and everything else can be automated.

Scott


On Tue, Nov 20, 2012 at 2:50 PM, Dennis Luxen dennis.lu...@gmail.comwrote:

 Hello,


  Yes. Bascially whatever was in Osmosis' state.txt file at the time this
 file was created.


 This is the use-case I had in mind when experimenting with time-stamps in
 PBF. Updating self-contained PBF files through Osmosis is a major advantage
 to using state.txt files. I, for one, plan to support such a time-stamp in
 OSRM from day one (or two).

 --Dennis


___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev


Re: [OSM-dev] Timestamp in PBF files

2012-11-20 Thread Scott Crosby
On Tue, Nov 20, 2012 at 6:51 AM, Frederik Ramm frede...@remote.org wrote:

 Hi,

 (message to dev list but explicitly Cc'ing Brett and Scott because I don't
 know if they follow dev)

about a year ago, Marqqs tried to have a discussion on how to add
 timestamps to PBF files and hardly anyone was interested.

 I've had a couple people ask me whether I could somehow add timestamp
 information to the PBF files that I produce for download.geofabrik.de so
 I'd be interested in solving this somehow.

 I really don't mind *how* it's done but I would really love to have one
 agreed way to place a timestamp in a PBF instead of everyone rolling their
 own.

 What's the current status of this discussion? Is there already an approved
 way to deal with this?


No status, but if anyone wants my opinion, when authoring the format, I
expected us to add metadata to planets, and expected it to be put into
OSMHeader as in the OSRM clone you linked to above. I would vote to
deprecate the use of the ISO timestring encoded into the optional_features
array, but continue to write to it to avoid breaking old installs of
Marqqs's tools. I also think that we have more than one notion of
timestamp. How does this sound:

message HeaderBlock {
  optional HeaderBBox bbox = 1;
  /* Additional tags to aid in parsing this dataset */
  repeated string required_features = 4;
  repeated string optional_features = 5;

  optional string writingprogram = 16;
  optional string source = 17; // From the bbox field
  optional sint64 timestamp = 18; // Unix-Time encoded into varint of
when the file was generated.
  optional sint64 mirror_timestamp = 64; // Unit-Time timestamp of the
last update the source. (used for mirroring)
}


What about adding other metadata or adding in a nanosecond timestamp while
we're at it?

Scott
___
dev mailing list
dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/dev