Re: [osmosis-dev] Improving completeWays/completeRelations performance

Scott Crosby Fri, 18 Feb 2011 05:16:08 -0800

On Fri, Feb 18, 2011 at 2:57 AM, Frederik Ramm <frede...@remote.org> wrote:
> Hi,
>
>   in the long run I'd like to change the Geofabrik extracts so that they
> have the completeWays/completeRelations feature enabled. It's a pain because
> that totally breaks the elegant and well-performing streaming mode in
> Osmosis but it would really make the extracts more usable, and more in line
> with what people get from the API.
>
> My biggest concern is the disk space used for temporary storage. If I read
> things correctly, a temporary storage of the input stream is created for
> each --bb or --bp task. So if you do something like
>
> osmosis --rb planet --tee 5
>  --bb ... --wb europe
>  --bb ... --wb asia
>  --bb ... --wb america
>  --bb ... --wb australia
>  --bb ... --wb africa
>
> then you will temporarily have 5 copies of the planet file lying around. So
> while, if there was only one copy of it, I could still hope to make use of
> linux file system buffers and a lot of RAM to soften the negative impact of
> file storage, that will kill performance for sure.
>



> I wonder if there is a way to at least reduce this to *one* temporary
> storage. The easiest thing I could imagine would be a new "multi-bb" (or
> "multi-bp") task that basically combines the tee and bb. That would be less
> elegant and would probably also be less efficient because it would not use
> multiple threads, but it could easily use one shared temporary storage.

>
> But I've been thinking: With the the high performance of PBF reading, a
> two-pass operation should become possible. Simply read the input file twice,
> determining which objects to copy in pass 1, and actually copying them in
> pass 2. I'm just not sure how that could be made to fit in Osmosis. One way
> could be creating a special type of file, a "selection list", from a given
> entity stream. A new task "--write-seelction-list" would dump the IDs of all
> nodes, ways, and relations that were either present or referenced in the
> entity stream:

I planned out how to do this, but never got around to implementing it,
because implementing it within the osmosis piping framework
was...unclear.

I used two bitsets for every output file. One indicating which nodes
were already output and another (built when processing ways)
indicating what node ID's were missed and will need to be grabbed on
the next pass. I have another two pair of bitsets for ways and
relations. Thats around 400mb of RAM per output.

In the first pass, assign nodes to one or more regions via geographic
location. To remember this mapping for assigning ways&relations, use a
sparse multihashmap. (Built from layering several sparsehashmaps
(http://code.google.com/p/google-sparsehash/)). As the keys are really
densely packed integers, don't store them; use a sparse-hash-array.
(Actually, use a hybrid approach, see the mkgmap splitter.)

Really, almost everything you want is already done from the mkgmap
splitter crosby_integration branch. Just make it support
non-rectangular regions, and track output/missed entities with
bitsets. I'd say about 4 hours and 2gb+400mb/region  to generate as
many outputs as you want.

One caveat is that each file is no longer sorted by
(entityType,entityId), and would need to be re-processed to have that
order.  sorted output.

My mechanism has two lists for each output.

Thats roughly what I planned on, but one trick that can signifigantly
reduce memory usage is to instead track a 'missing list'. what nodes
does this extract need that aren't dumped.

> Also,
> what I have sketched above would be able to give you
>
> * all nodes in the bounding box
> * all ways using any of these nodes
> * all nodes used by any of these ways even if outside
> * all relations using any of these nodes or ways
> o all nodes and ways used by any of these relations even if outside
> o but NOT all nodes used by a way drawn in through a relation.
>
> (The points marked "*" are what the API does; the API does not do the "o"
> marked points even though users could be interested in them.)

I did not know that it was allowed to miss the things in "o". That
makes the job easier.

>
> Does anybody have any thoughts about this; maybe a different approach still?

There's a second approach that could split the entire planet into
extracts, simultaneously, in 2-3 minutes; a new planet format that I'm
working on that is geographically sorted. Progress is ongoing, but
slow.

Scott

_______________________________________________
osmosis-dev mailing list
osmosis-dev@openstreetmap.org
http://lists.openstreetmap.org/listinfo/osmosis-dev

Re: [osmosis-dev] Improving completeWays/completeRelations performance

Reply via email to