Re: [OSM-dev] Simpler binary OSM formats

Andrew Byrd Mon, 08 Feb 2016 02:10:22 -0800

> On 07 Feb 2016, at 20:10, Дмитрий Киселев <[email protected]> wrote:
> 
> As for fixed sized blocks in vex, I did consider that option but couldn’t 
> come up with a compelling reason for it. I can see the case for a maximum 
> block size (so we know what the maximum size of allocation will be), but can 
> you give a concrete example of how fixed-size blocks would be advantageous in 
> practice? I would be very hesitant to split any entity across multiple blocks.
> 
> 
> When you need relations-ways-nodes read order, blocks will save you  from 
> unnecessary read-through the whole file (yes, you can skip decompression for 
> nodes/ways but still you must read the whole file).


Let me rephrase the question: You specifically mentioned blocks of a 
predetermined, fixed size. How would having fixed-size blocks (as opposed to 
the current variable sized blocks) improve your ability to seek to different 
entity types within a file? Maybe you are thinking of doing a binary search 
through the file rather than a linear search for the blocks of interest. But 
that means the vex blocks would need to be a fixed size after compression, not 
before compression. It seems too complex to require writers to target an exact 
post-compression block size.

Also, I think your observation that “you must read the whole file” when seeking 
ahead to another entity type requires some additional nuance. You must only 
read the header of each block, at which point you know how long that block is 
and you can seek ahead to the next block. So indeed, you’d touch at least one 
page or disk block per vex block. Pages are typically 4 kbytes, so if your vex 
blocks are a few Mbytes in size, you would only access on the order of 1/1000 
of the pages while seeking ahead to a particular entity type. 

To me, it seems much more advantageous to provide a table of file offsets 
stating where each entity type begins. I have already considered adding this to 
vex after the basic format is established (like author metadata and map 
layers). It seems appropriate to place such a table at the end of the vex data, 
because this allows the writer to produce output as a stream (no backward 
seeks) and a reader can only make effective use of this table if it’s buffering 
the data and able to seek within the file.

> Second example: find something by id, if you have blocks it's easy to map 
> whole block into memory and do a binary search for logN block reads instead 
> of seeing through a file all the time.

Unlike o5m I have not included any requirement that IDs be in a particular 
order, which means binary searches are not always possible. I see vex as a data 
interchange format usable in both streaming and disk-backed contexts, not as a 
replacement for an indexed database table. It’s an interesting idea to try to 
serve both purposes at once and be able to quickly seek to an ID within a flat 
data file, but I’m not sure if such capabilities are worth the extra 
complexity. Such a binary search, especially if performed repeatedly for 
different entities, would be touching (and decompressing) a lot of disk blocks 
/ memory pages because the IDs you’re searching through are mixed in with the 
rest of the data rather than in a separate index as they would be in a database.

Andrew

_______________________________________________
dev mailing list
[email protected]
https://lists.openstreetmap.org/listinfo/dev

Re: [OSM-dev] Simpler binary OSM formats

Reply via email to