Another thought.

It looks like a "Sub-task operator". Kind of a special "Operator" to handle
such case (for example Docker-Compose or Kubernetes-POD driven).

Currently we can trigger other tasks at the same time (parallel) or
sequentially (with dependency for finishing/failing the task). But we could
also have dependencies that are triggered DURING the task execution (when
some of the data has already been produced but some other processes are
still running inside the task - and will trigger another dependency after
it finishes).

We could have the operator trigger other tasks during the run (when only
one POD finishes its work for example). This could be far less
paradigm-changing and can work with all the executors I think. It's still a
bit niche case though and maybe unnecessary optimization.

J.

On Tue, Nov 26, 2019 at 12:25 PM Ash Berlin-Taylor <a...@apache.org> wrote:

> My gut reaction to this is no, not as a general purpose thing. It would
> only work for LocalExecutor reliably - it's never going to work with Kube
> executor, and almost never work in Celery.
>
> Also the cases where the data is small enough to fit in memory but not so
> large you need to put it in S3/hdfs etc don't seem worth special casing for?
>
> > On 26 Nov 2019, at 11:21, Jarek Potiuk <jarek.pot...@polidea.com> wrote:
> >
> > *TL;DR; Discuss whether shared memory data sharing for some tasks is an
> > interesting feature for future Airflow.*
> >
> > I had a few discussions recently with several Airflow users (including at
> > Slack [1] and in person at Warsaw Airflow meetup) about using shared
> memory
> > for inter-task communication.
> >
> > Airflow is not currently good for such case. It sounds doable, but fairly
> > complex to implement (and modifies Airflow paradigm a bit). I am not 100%
> > sure if it's a good idea to have such feature in the future.
> >
> > I see the need for it and I like it, however I would love to ask you for
> > opinions.
> >
> > *Context*
> >
> > The case is to have several independent tasks using a lot of temporary
> data
> > in memory. They either run in parallel and share loaded data, or use
> shared
> > memory to pass results between tasks. Examples: machine learning (like
> > audio processing). It makes sense to only load the audio files once (to
> > memory) and run several tasks on those loaded data.
> >
> > Best way to achieve it now is to combine such sharing-memory tasks into
> > single operator (Docker-compose for example ?) and run them as a single
> > Airflow Task. But maybe those tasks could still be modelled as separate
> > tasks in Airflow DAG. One benefit is that there might be different
> > dependencies for different tasks, processing results from some tasks
> could
> > be sent independently using different - existing - operators.
> >
> > As a workaround - we can play with queues and have one dedicated machine
> to
> > run all such tasks, but it has multiple limitations.
> >
> > *High-level idea*
> >
> > High level  it would require defining some affinity between tasks to make
> > sure that:
> >
> > 1) they are all executed on the same worker machine
> > 2) the processes should remain in-memory until all tasks finish for data
> > sharing (even if there is a dependency between the tasks)
> > 3) back-filling should act on the whole group of such tasks as "single
> > unit".
> >
> > I would love to hear your feedback.
> >
> > J
> >
> >
> > [1] Slack discussion on shared memory:
> > https://apache-airflow.slack.com/archives/CCR6P6JRL/p1574745209437200
> >
> > J.
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to