[osmosis-dev] --used-node performance and a possible way to improve it

Igor Podolskiy Wed, 11 May 2011 12:30:59 -0700

Hi Osmosis developers,

recently I found myself filtering large extracts of data with fairlyrestrictive tag-based filters. Imagine a task like "get alladministrative boundaries in the state of Baden-Württemberg, Germany".It was taking a long time, lots of CPU, and lots of I/O. And I askedmyself what it was doing all that time, actually.

I had a pipeline like [1] (OSMembrane screenshot). Just theold-fashioned filter for ways and relations with a merge at the end (inthe screenshot, top row is ways, bottom row is relations).

Turns out that the main culprit is the --used-node task. As you surelyknow, it works like this:

1. Store all ways, nodes and relations coming in into a "simple objectstore".

2. During this, records all node references.

3. Replay the simple object store to the output, filtering out unneedednodes.


Basically, in a workflow like

   read from disk -> filter ways -> get used nodes -> write

you basically write _everything_ you got from disk back to disk and thenread it back again in --used-nodes. More than that, you only can controlthe filesystem the second write happens by setting the java.io.tmpdirproperty which isn't really intuitive. And you spend CPU timecompressing and decompressing the intermediate store.

So, in actual numbers for boundaries in Baden-Württemberg (128 MB PBF asof today) this workflow boils down to "read 128 MB from disk, write ~180MB gzipped serialized objects to disk, read ~180 MB from disk, write 2MB PBF to disk." In the example pipeline shown above, those ~180 MB ofgzipped serialized objects get written and read _twice_ because of two--used-node tasks.

This seemed, well, a little wasteful to me. You should only pay for whatyou getting (the 2 MB), not for everything there is (the 128 MB), andsurely not twice :) So I thought up an approach which avoidsintermediate stores.

It involves a task that takes two input streams and produces a singleone. It works like this:

1. Read everything from the first stream and ignore all nodes, recordthe required ids for ways and relations in an id tracker just like--used-node, and pass the ways and relations downstream immediately.

2. Read everything from the second stream, ignore all ways and relationsand only pass the required nodes (based on the id tracker) downstream.


It's a bit like a merge with an id tracker.

In terms of the complete workflow, it involves reading the source filefrom disk twice; the pipeline equivalent to [1] is shown in [2] (anotherOSMembrane screenshot).

I implemented this task (named --fast-used-node, better name needed ;))and made a couple of measurements for the example I mentioned above(admin boundaries in Baden-Württemberg).


Pipeline [1] with --used-node: ~312+-10 seconds
Pipeline [2] with --fast-used-node: ~140+-10 seconds

A simple pipeline like read->filter ways->used-nodes->write takes about140 seconds with --used-node, the --fast-used-node one takes ~90 secondscomplete.

All numbers on Pentium Dual-Core E5300 2.6 GHz, Win7 Pro 32-Bit, vanillaSATA disk, default 64MB heap size (irrelevant for this task). Bothapproaches seem to be CPU-bound (so the compression/decompression ismore a problem than the IO in and of itself).

Of course, everything has a price. First, you effectively need to readthe source file twice from disk; just splitting up a stream andbuffering it isn't enough, as all buffers will eventually fill up andeverything will come to a halt. That assumes that the source stream canbe read twice in the first place, so network sources or stdin won'twork, at least for now. Also, the pipelines get more complex, and thewhole principle is a bit harder to understand than the straightforward one.

And finally, it changes the sorting order to "ways/relations, thennodes" - don't know if this is a big problem.

Anyway, for my use case, it works[TM], and I figured that this use case- restrictive tag-based filtering of big source file-on-disk datasets -would be quite common.

So what do you think - do we/you want that patch with --fast-used-node(and probably a similar one for --fast-used-way)? :) Or is it toospecial? Or not worthwhile for some other reason?

I would be really glad to hear your feedback, if you can spare some ofyour time for it.


In the hope this will help someone with something,
Best regards,
Igor

[1] http://i.imgur.com/beqT6.png
[2] http://i.imgur.com/nV3kL.png

_______________________________________________
osmosis-dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/osmosis-dev

[osmosis-dev] --used-node performance and a possible way to improve it

Reply via email to