Re: [osmosis-dev] --used-node performance and a possible way to improve it

Igor Podolskiy Fri, 03 Jun 2011 13:05:38 -0700

Hi Brett, hi @osmosis-dev,

You're describing a common problem with the Osmosis pipeline.  Many
scenarios would be improved if you could access the data stream twice
rather than having to buffer all data within the task itself.

good to know I'm not alone ;)

In the meantime I've been working quite a lot with Osmosis, with thestandard tasks, and with my custom ones, and I've been following thediscussions on this list, especially your posts about metadata... so nowI think I have some clue about what actually lies behind these problemsand solutions, mine included. In a way, this is also a reply to yourrecent post about metadata. Feel free to skip my ramblings :)

Here's the TL;DR version: my --fast-used-* is really just a workaroundand should really stay a plugin and out of the main tree. What I thinkwe need is a much more generic solution, like a second type ofcommunication channel between the tasks for flow control information.


Full version:

There seem to be many use cases for Osmosis which more or less blow upthe current streaming principle, --used-* being only one of them. That'swhy we're talking on this list about additional metadata, andcompleteWays performance of --bbox, and performance of --used-node, andRestartExceptions.

Even the <bound> handling is actually wrong at many places, and it can'tbe made right because of the stream ordering requirements. I somewhatfixed it in --merge, but it's more or less unfixable in, say,--apply-change without caching the whole stream which leads to the--used-node problem (lots of wasted CPU, I/O and disk space).

As I used Osmosis over the last couple of weeks, and I often foundmyself thinking: can I do X with Osmosis? Can I optimize step Y of thepipeline? And the answer is almost always: no, it requires replaying thestream. No, it requires a bit of information from another task up ordown the pipeline. No, it requires some coordination or synchronizationbetween multiple tasks.

In short, and I think that's the core problem: the answer is no because_the tasks do not know enough about each other_ both before and duringprocessing.

And as tempting as some metadata embedded in the data stream like youimplemented in your last commits or a workaround implemented with my--fast-used-node might be - in my very humble and uninformed opinion ithandles symptoms, not this core problem.

What we have now, is very similar to the plain old analog telephonenetwork: we transmit both data and control information over a singlechannel. And there's a reason the telecom networks switched toout-of-band signaling with separate channels for data and metadata likecall setup: it is more flexible. Before that, you had crazy workaroundstuff like "hook flash" where you submitted information by actuallyinterrupting the channel with a particular timing. Maybe that's just me,but throwing an exception up the pipeline is in a way very much like ahook flash :)

As to the RestartStreamException in particular, I have a feeling that itisn't really going to work well with --tee, --buffer, --merge,--apply-change and similar tasks. And even if, an exception reallyseriously messes up your control flow. Once you throw it, you cannotreally make any assumptions as to what state you're currently in. Andeven if you work around it, that code going to be _really_ messy.

So I think what we need is a full-blown "control plane" forcommunication between tasks. Like in telecommunications, it should beorthogonal to the data streams, otherwise you will always be having theproblem that the bit of information that you need is in the wrong placeof the stream. We see this with <bound>: you need to write it out at thestart of the stream but can only you know it at the end in the currentmodel. That's why you would need to cache a GB-sized stream which seemspretty wasteful for 4 doubles.

I'd like a way for a task to know what is downstream and upstream of thepipeline. I'd like proper synchronization like locks and latches adnwaits between tasks - yes, synchronization is complex and can lead todeadlocks, but we also can have deadlocks _now_ without the benefits ofsynchronization. I'd like a way for one task to be able to say what itis going to do to stream with respect to the sort ordering or thebounding box. And I like for other tasks to be able to reason about thatin the context of the current pipeline.

Enough I'd likes. :) If you have a good "control plane", good thingshappen, that's it, actually ;) I very well understand that this is goingto make the task API more complex. But I think it would pay off as youcould use Osmosis for more tasks. Passing metadata in the data streamand exceptions and --fast-used-node make things more complex as well,and the payoff is less IMHO. You're right when you say that implementinga not good enough solution is worse that not implementing one at all -so the question for me is, really: what is "good enough"?

I have some - very basic - thoughts about how that "control plane" couldwork. I think a Wiki page with a more thorough description of thatapproach would be a better way to communicate it. I'll try to do that.Or would you better like it here on the list?

As to your hesitation to accept the --fast-used-* workaround: you'reabsolutely right, I understand that now. There are issues with thosetasks, and I could and maybe will address some of those issues - but itwill still stay a workaround. Now even I don't think it should go in themain distribution.

I've packaged --fast-used-* as a plugin and I'm going to make it publicsomewhere for those who need it now, like myself. What would be a goodplace to do that, BTW?


Thank you for taking your time to read all of this,
Greeting from Stuttgart, Germany,
Igor

_______________________________________________
osmosis-dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/osmosis-dev

Re: [osmosis-dev] --used-node performance and a possible way to improve it

Reply via email to