Hey Jochen, That's useful, but still does not allow you to optimize threading because you don't know in advance how many blocks there are, or is there a way to know / estimate this based on the '~8k' block size (from the wiki)?
If osmosis is the reference implementation, is there a reason why it doesn't seem to leverage this block structure to speed up reading? Or does it? Martijn Martijn van Exel skype: mvexel On Tue, Apr 28, 2015 at 11:22 PM, Jochen Topf <[email protected]> wrote: > On Di, Apr 28, 2015 at 06:56:23 -0600, Martijn van Exel wrote: >> Not sure if this has been discussed recently, but we've been thinking >> about improving osmosis PBF reading performance over at Telenav. My >> colleague Jon (cc) has come up with a suggestion that I want to put >> forward for discussion. I'm posting this to both osmosis-dev as well >> as dev because it affects the PBF format definition. >> >> When reading a large PBF resource from a random access file (as >> opposed to a stream), it might be possible to significantly increase >> throughput by reading data of the same entity type from multiple >> threads simultaneously, making use of an optional directory structure >> to locate valid blocks of nodes, ways and relations for threads to >> consume. >> >> To support parallel access, an optional directory_offset might be >> added to the HeaderBlock: >> >> message HeaderBlock { >> … >> optional int64 directory_offset >> } >> >> The directory_offset field would be the seek location in the file of a >> Directory message which is written at the end of the file (since the >> directory is flexible in length and all offsets are only known after >> writing all data to the PBF file). The directory itself is simply a >> list of valid read offsets for each entity type. Threads can read data >> from a given offset in the list to the next offset. The best chunk >> size for blocks in the directory can be determined through >> experimentation, although something around 1MB might be a good first >> guess. >> >> message Directory { >> repeated int64 node_block_offsets; >> repeated int64 way_block_offsets; >> repeated int64 relation_block_offsets; >> } >> >> Before we explore this further, I'd like to know if this has been >> attempted before, and what concerns there may be. > > PBF files already come in blocks with a length header in front of every > block. Osmium reads this length header in one thread and then puts the > data of each block into a work queue to be parsed by as many threads as > you want. This way you already get a nice speedup without any changes to > the file format. > > Jochen > -- > Jochen Topf [email protected] http://www.jochentopf.com/ +49-351-31778688 _______________________________________________ dev mailing list [email protected] https://lists.openstreetmap.org/listinfo/dev

