Hi Brett, hi @osmosis-dev,

You're describing a common problem with the Osmosis pipeline.  Many
scenarios would be improved if you could access the data stream twice
rather than having to buffer all data within the task itself.
good to know I'm not alone ;)

In the meantime I've been working quite a lot with Osmosis, with the standard tasks, and with my custom ones, and I've been following the discussions on this list, especially your posts about metadata... so now I think I have some clue about what actually lies behind these problems and solutions, mine included. In a way, this is also a reply to your recent post about metadata. Feel free to skip my ramblings :)

Here's the TL;DR version: my --fast-used-* is really just a workaround and should really stay a plugin and out of the main tree. What I think we need is a much more generic solution, like a second type of communication channel between the tasks for flow control information.

Full version:

There seem to be many use cases for Osmosis which more or less blow up the current streaming principle, --used-* being only one of them. That's why we're talking on this list about additional metadata, and completeWays performance of --bbox, and performance of --used-node, and RestartExceptions.

Even the <bound> handling is actually wrong at many places, and it can't be made right because of the stream ordering requirements. I somewhat fixed it in --merge, but it's more or less unfixable in, say, --apply-change without caching the whole stream which leads to the --used-node problem (lots of wasted CPU, I/O and disk space).

As I used Osmosis over the last couple of weeks, and I often found myself thinking: can I do X with Osmosis? Can I optimize step Y of the pipeline? And the answer is almost always: no, it requires replaying the stream. No, it requires a bit of information from another task up or down the pipeline. No, it requires some coordination or synchronization between multiple tasks.

In short, and I think that's the core problem: the answer is no because _the tasks do not know enough about each other_ both before and during processing.

And as tempting as some metadata embedded in the data stream like you implemented in your last commits or a workaround implemented with my --fast-used-node might be - in my very humble and uninformed opinion it handles symptoms, not this core problem.

What we have now, is very similar to the plain old analog telephone network: we transmit both data and control information over a single channel. And there's a reason the telecom networks switched to out-of-band signaling with separate channels for data and metadata like call setup: it is more flexible. Before that, you had crazy workaround stuff like "hook flash" where you submitted information by actually interrupting the channel with a particular timing. Maybe that's just me, but throwing an exception up the pipeline is in a way very much like a hook flash :)

As to the RestartStreamException in particular, I have a feeling that it isn't really going to work well with --tee, --buffer, --merge, --apply-change and similar tasks. And even if, an exception really seriously messes up your control flow. Once you throw it, you cannot really make any assumptions as to what state you're currently in. And even if you work around it, that code going to be _really_ messy.

So I think what we need is a full-blown "control plane" for communication between tasks. Like in telecommunications, it should be orthogonal to the data streams, otherwise you will always be having the problem that the bit of information that you need is in the wrong place of the stream. We see this with <bound>: you need to write it out at the start of the stream but can only you know it at the end in the current model. That's why you would need to cache a GB-sized stream which seems pretty wasteful for 4 doubles.

I'd like a way for a task to know what is downstream and upstream of the pipeline. I'd like proper synchronization like locks and latches adn waits between tasks - yes, synchronization is complex and can lead to deadlocks, but we also can have deadlocks _now_ without the benefits of synchronization. I'd like a way for one task to be able to say what it is going to do to stream with respect to the sort ordering or the bounding box. And I like for other tasks to be able to reason about that in the context of the current pipeline.

Enough I'd likes. :) If you have a good "control plane", good things happen, that's it, actually ;) I very well understand that this is going to make the task API more complex. But I think it would pay off as you could use Osmosis for more tasks. Passing metadata in the data stream and exceptions and --fast-used-node make things more complex as well, and the payoff is less IMHO. You're right when you say that implementing a not good enough solution is worse that not implementing one at all - so the question for me is, really: what is "good enough"?

I have some - very basic - thoughts about how that "control plane" could work. I think a Wiki page with a more thorough description of that approach would be a better way to communicate it. I'll try to do that. Or would you better like it here on the list?

As to your hesitation to accept the --fast-used-* workaround: you're absolutely right, I understand that now. There are issues with those tasks, and I could and maybe will address some of those issues - but it will still stay a workaround. Now even I don't think it should go in the main distribution.

I've packaged --fast-used-* as a plugin and I'm going to make it public somewhere for those who need it now, like myself. What would be a good place to do that, BTW?

Thank you for taking your time to read all of this,
Greeting from Stuttgart, Germany,
Igor

_______________________________________________
osmosis-dev mailing list
[email protected]
http://lists.openstreetmap.org/listinfo/osmosis-dev

Reply via email to