There was an old design document I proposed on this ML a while back.
I never got around to implementing it and I think it has aged somewhat
but it covers some of the points I brought up and it might be worth
reviewing.

https://docs.google.com/document/d/1MfVE9td9D4n5y-PTn66kk4-9xG7feXs1zSFf-qxQgPs/edit#heading=h.e54mys6bvhhe

On Tue, Apr 26, 2022 at 10:05 AM Sasha Krassovsky
<krassovskysa...@gmail.com> wrote:
>
> An ExecPlan is composed of a bunch of implicit “pipelines”. Each node in a 
> pipeline (starting with a source node) implements `InputReceived` and 
> `InputFinished`. On `InputReceived`, it performs its computation and calls 
> `InputReceived` on its output. On `InputFinished`, it performs any cleanup 
> and calls `InputFinished` on its output (note that in the code, `outputs_` is 
> a vector, but we only ever use `outputs_[0]`. This will probably end up 
> getting cleaned up at some point). As such there’s an implicit pipeline of 
> chained calls to `InputReceived`. Some nodes, such as Join or GroupBy or Sort 
> are pipeline breakers: they must accumulate the whole dataset before 
> performing their computation and starting off the next pipeline. Pipeline 
> breakers would make use of stuff like TaskGroup and such.
>
> So the model of parallelism is driven by the source nodes: if your source 
> node is multithreaded, then you may have several concurrent calls to 
> `InputReceived`. Weston mentioned to me today that there may be a way to give 
> some sort of guarantee of “almost-ordered” input, which may be enough to make 
> streaming work well (you’d only have to accumulate at most `num_threads` 
> extra batches in memory at a time). I’m not sure the details of it, but that 
> may be possible.
>
> Hopefully the description of how parallelism works was at least helpful!
>
> Sasha
>
> > On Apr 26, 2022, at 12:54 PM, Li Jin <ice.xell...@gmail.com> wrote:
> >
> > sure how they would output. (i.e., do they output batches / call
>

Reply via email to