Hi Igor,

On Sat, Jun 4, 2011 at 6:04 AM, Igor Podolskiy <
igor.podols...@vwi-stuttgart.de> wrote:

> Hi Brett, hi @osmosis-dev,
>  You're describing a common problem with the Osmosis pipeline.  Many
>> scenarios would be improved if you could access the data stream twice
>> rather than having to buffer all data within the task itself.
> good to know I'm not alone ;)
> In the meantime I've been working quite a lot with Osmosis, with the
> standard tasks, and with my custom ones, and I've been following the
> discussions on this list, especially your posts about metadata... so now I
> think I have some clue about what actually lies behind these problems and
> solutions, mine included. In a way, this is also a reply to your recent post
> about metadata. Feel free to skip my ramblings :)
> Here's the TL;DR version: my --fast-used-* is really just a workaround and
> should really stay a plugin and out of the main tree. What I think we need
> is a much more generic solution, like a second type of communication channel
> between the tasks for flow control information.
> Full version:
> There seem to be many use cases for Osmosis which more or less blow up the
> current streaming principle, --used-* being only one of them. That's why
> we're talking on this list about additional metadata, and completeWays
> performance of --bbox, and performance of --used-node, and
> RestartExceptions.
> Even the <bound> handling is actually wrong at many places, and it can't be
> made right because of the stream ordering requirements. I somewhat fixed it
> in --merge, but it's more or less unfixable in, say, --apply-change without
> caching the whole stream which leads to the --used-node problem (lots of
> wasted CPU, I/O and disk space).
> As I used Osmosis over the last couple of weeks, and I often found myself
>  thinking: can I do X with Osmosis? Can I optimize step Y of the pipeline?
> And the answer is almost always: no, it requires replaying the stream. No,
> it requires a bit of information from another task up or down the pipeline.
> No, it requires some coordination or synchronization between multiple tasks.
> In short, and I think that's the core problem: the answer is no because
> _the tasks do not know enough about each other_ both before and during
> processing.
> And as tempting as some metadata embedded in the data stream like you
> implemented in your last commits or a workaround implemented with my
> --fast-used-node might be - in my very humble and uninformed opinion it
> handles symptoms, not this core problem.
> What we have now, is very similar to the plain old analog telephone
> network: we transmit both data and control information over a single
> channel. And there's a reason the telecom networks switched to out-of-band
> signaling with separate channels for data and metadata like call setup: it
> is more flexible. Before that, you had crazy workaround stuff like "hook
> flash" where you submitted information by actually interrupting the channel
> with a particular timing. Maybe that's just me, but throwing an exception up
> the pipeline is in a way very much like a hook flash :)
> As to the RestartStreamException in particular, I have a feeling that it
> isn't really going to work well with --tee, --buffer, --merge,
> --apply-change and similar tasks. And even if, an exception really seriously
> messes up your control flow. Once you throw it, you cannot really make any
> assumptions as to what state you're currently in. And even if you work
> around it, that code going to be _really_ messy.
> So I think what we need is a full-blown "control plane" for communication
> between tasks. Like in telecommunications, it should be orthogonal to the
> data streams, otherwise you will always be having the problem that the bit
> of information that you need is in the wrong place of the stream. We see
> this with <bound>: you need to write it out at the start of the stream but
> can only you know it at the end in the current model. That's why you would
> need to cache a GB-sized stream which seems pretty wasteful for 4 doubles.
> I'd like a way for a task to know what is downstream and upstream of the
> pipeline. I'd like proper synchronization like locks and latches adn waits
> between tasks - yes, synchronization is complex and can lead to deadlocks,
> but we also can have deadlocks _now_ without the benefits of
> synchronization. I'd like a way for one task to be able to say what it is
> going to do to stream with respect to the sort ordering or the bounding box.
> And I like for other tasks to be able to reason about that in the context of
> the current pipeline.
> Enough I'd likes. :) If you have a good "control plane", good things
> happen, that's it, actually ;) I very well understand that this is going to
> make the task API more complex. But I think it would pay off as you could
> use Osmosis for more tasks. Passing metadata in the data stream and
> exceptions and --fast-used-node make things more complex as well, and the
> payoff is less IMHO. You're right when you say that implementing a not good
> enough solution is worse that not implementing one at all - so the question
> for me is, really: what is "good enough"?
> I have some - very basic - thoughts about how that "control plane" could
> work. I think a Wiki page with a more thorough description of that approach
> would be a better way to communicate it. I'll try to do that. Or would you
> better like it here on the list?

Thanks for the detailed email.  I don't have any strong thoughts regarding
your ideas.  I do worry that it would add too much complexity and therefore
result in a less flexible tool than exists right now due to less tasks being
available.  It might make the barrier to entry even higher than it is now
and scare off the few people that understanding the existing codebase.  On
the other hand, the current codebase has been fairly stagnant for a long
time now so some fresh ideas fixing the existing limitations might be just
what is needed to kick things along.

I don't have a good feel for how a control plane would work in practice.
The change that you're proposing is more than I have time to be involved in
so I wouldn't want to waste your time by debating on the wiki or mailing
list.  As with all things OSM, the best way is typically to create something
and see if others are interested.  I'd ask you to keep it separate to the
existing codebase (ie. create a branch, or GIT tree, or whatever strategy
works for you) until it's proven and somebody is ready to maintain the new
codebase longer term.  Osmosis itself started out as a very simple set of
tasks I used personally and it gradually grew into the larger tool available
today.  I'd suggest it's best to try a few ideas out yourself and see how
they work in practice rather than try to hash out a design in a public

Apologies if any of the above comes across negative.  I strongly encourage
you to see what you can come up with.  I just not able to get involved

> As to your hesitation to accept the --fast-used-* workaround: you're
> absolutely right, I understand that now. There are issues with those tasks,
> and I could and maybe will address some of those issues - but it will still
> stay a workaround. Now even I don't think it should go in the main
> distribution.
> I've packaged --fast-used-* as a plugin and I'm going to make it public
> somewhere for those who need it now, like myself. What would be a good place
> to do that, BTW?

There's no obvious place to host plugins.  The plugins that exist now are
hosted in various locations.  Wherever you choose to host it, you can add
documentation to the existing wiki pages along with links to downloading
it.  If it sees usage by a reasonable number of users then it could be added
to the existing distribution.  It's a self contained task so it doesn't
cause problems for the rest of the codebase.

> Thank you for taking your time to read all of this,

No worries.  Sorry for the delay in responding :-)

osmosis-dev mailing list

Reply via email to