Hi Osmosis developers,

recently I found myself filtering large extracts of data with fairly restrictive tag-based filters. Imagine a task like "get all administrative boundaries in the state of Baden-Württemberg, Germany". It was taking a long time, lots of CPU, and lots of I/O. And I asked myself what it was doing all that time, actually.

I had a pipeline like [1] (OSMembrane screenshot). Just the old-fashioned filter for ways and relations with a merge at the end (in the screenshot, top row is ways, bottom row is relations).

Turns out that the main culprit is the --used-node task. As you surely know, it works like this:

1. Store all ways, nodes and relations coming in into a "simple object store".
2. During this, records all node references.
3. Replay the simple object store to the output, filtering out unneeded nodes.

Basically, in a workflow like

   read from disk -> filter ways -> get used nodes -> write

you basically write _everything_ you got from disk back to disk and then read it back again in --used-nodes. More than that, you only can control the filesystem the second write happens by setting the java.io.tmpdir property which isn't really intuitive. And you spend CPU time compressing and decompressing the intermediate store.

So, in actual numbers for boundaries in Baden-Württemberg (128 MB PBF as of today) this workflow boils down to "read 128 MB from disk, write ~180 MB gzipped serialized objects to disk, read ~180 MB from disk, write 2 MB PBF to disk." In the example pipeline shown above, those ~180 MB of gzipped serialized objects get written and read _twice_ because of two --used-node tasks.

This seemed, well, a little wasteful to me. You should only pay for what you getting (the 2 MB), not for everything there is (the 128 MB), and surely not twice :) So I thought up an approach which avoids intermediate stores.

It involves a task that takes two input streams and produces a single one. It works like this:

1. Read everything from the first stream and ignore all nodes, record the required ids for ways and relations in an id tracker just like --used-node, and pass the ways and relations downstream immediately.

2. Read everything from the second stream, ignore all ways and relations and only pass the required nodes (based on the id tracker) downstream.

It's a bit like a merge with an id tracker.

In terms of the complete workflow, it involves reading the source file from disk twice; the pipeline equivalent to [1] is shown in [2] (another OSMembrane screenshot).

I implemented this task (named --fast-used-node, better name needed ;)) and made a couple of measurements for the example I mentioned above (admin boundaries in Baden-Württemberg).

Pipeline [1] with --used-node: ~312+-10 seconds
Pipeline [2] with --fast-used-node: ~140+-10 seconds

A simple pipeline like read->filter ways->used-nodes->write takes about 140 seconds with --used-node, the --fast-used-node one takes ~90 seconds complete.

All numbers on Pentium Dual-Core E5300 2.6 GHz, Win7 Pro 32-Bit, vanilla SATA disk, default 64MB heap size (irrelevant for this task). Both approaches seem to be CPU-bound (so the compression/decompression is more a problem than the IO in and of itself).

Of course, everything has a price. First, you effectively need to read the source file twice from disk; just splitting up a stream and buffering it isn't enough, as all buffers will eventually fill up and everything will come to a halt. That assumes that the source stream can be read twice in the first place, so network sources or stdin won't work, at least for now. Also, the pipelines get more complex, and the whole principle is a bit harder to understand than the straightforward one.

And finally, it changes the sorting order to "ways/relations, then nodes" - don't know if this is a big problem.

Anyway, for my use case, it works[TM], and I figured that this use case - restrictive tag-based filtering of big source file-on-disk datasets - would be quite common.

So what do you think - do we/you want that patch with --fast-used-node (and probably a similar one for --fast-used-way)? :) Or is it too special? Or not worthwhile for some other reason?

I would be really glad to hear your feedback, if you can spare some of your time for it.

In the hope this will help someone with something,
Best regards,
Igor

[1] http://i.imgur.com/beqT6.png
[2] http://i.imgur.com/nV3kL.png

_______________________________________________
osmosis-dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/osmosis-dev

Reply via email to