There is an attempt that we are working on in the "autoscaling" group -
there is a work in progress for KNative executor, which is the "serverless"
approach. And while we will likely release it in Airflow 2.0, we found a
number of limitations of the current serverless approach (at least KNative
variant of it) that makes KNative executor use very limited. My
interpretation of it is that - on a very high level - Serverless approach
is not very good when there is quite a state to be shared between tasks. In
case of Airflow (and when you have a big number of Dags and Tasks), there
is a significant amount of code + dependencies to be shared between various
tasks  - and by their nature they can change (albeit slow). It is however
large and dynamic enough to be a concern for serverless approaches. when it
comes to optimisations and dependencies reue. There are also limitations
when it comes to task running time - it is not uncommon in Airflow that a
task can run for many hours. Both limitations make it not very well suited
for a serverless approach IMHO.


J.



On Sat, Jan 25, 2020 at 8:00 PM Kamil Breguła <[email protected]>
wrote:

> The problem is Airflow flexibility. All tasks are defined as Python
> code and this code must be executed to see if the tasks can be
> executed. Let's add that one machine cannot execute code from
> different users, because Python does not provide the required
> isolation. So we have a case that you still need to run the scheduler
> to be able to check if the task is to be carried out. It would be
> possible to do this if each task was static and its condition could
> not change during execution. I also pay attention to the fact that
> serverless services use one central database e.g. Google Spanner, but
> in Airflow every operator can perform any database query.
>
> Some providers already provide workflow as a serverless service, but
> their practical use is very limited.
> https://aws.amazon.com/step-functions/
>
> Note that successful autoscaling for workers is being undertaken to
> ensure more efficient use of resources at a time of limited activity.
> Next, attempts will be made to decouple the database e.g. by creating
> an API and communicating only through it.
>
> On Sat, Jan 25, 2020 at 7:01 PM Malthe <[email protected]> wrote:
> >
> > In a typical deployment, the scheduler and worker processes run as
> > daemons, lingering during idle periods (the scheduler literally sleeps
> > as part of its run loop).
> >
> > Meanwhile, deploying to a serverless platform is attractive for
> > various reasons (cost, security, simplicity, scaling).
> >
> > For example, on Azure where there's a free grant of 1,000,000
> > executions per month, running a single iteration of the Airflow
> > scheduler loop on a timer trigger set to once every 3 seconds would
> > cost nothing. Still using Azure as an example, tasks might be
> > submitted using an event grid connected to an Azure Function App which
> > would immediately pick up one or more tasks.
> >
> > It seems like this wouldn't be too difficult to implement for a given
> > cloud provider, but I'd like to ask on this list whether there are
> > obvious issues with this execution model and whether this has been
> > suggested or even implemented previously. I wasn't able to find any
> > mention.
> >
> >  --\--
> > cheers
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Reply via email to