Hey Daniel,
Great work. We're looking at running airflow on AWS ECS inside docker
containers and making great progress on this.
We use redis and RDS as managed services to form a comms backbone and then
just spawn webserver, scheduler, worker and flower containers
as needed on ECS. We deploy dags using an Elastic File System (shared
across all instances), which then map this read-only into the docker
container.
We're now evaluating this setup going forward in more earnest.
Good idea to use queues to separate dependencies or some other concerns
(high-mem pods?), there are many ways this way that it's possible to
customize where and on which hw a DAG is going to run. We're looking at
Cycle scaling to temporarily increase resources in a morning run and create
larger worker containers for data science tasks and perhaps some other
tasks.
- In terms of tooling: The current airflow config is somewhat static in
the sense that it does not reconfigure itself to the (now) dynamic
environment.
You'd think that airflow should have to query the environment to figure
out parallellism instead of statically specifying this.
- Sometimes DAGs import hooks or operators that import dependencies at the
top. The only reason, (I think) that a scheduler needs to physically
import and parse a DAG is because there may be dynamically built elements
within the DAG. If there wouldn't be static elements, it is theoretically
possible to optimize this. Your PDF sort of hints towards a system
where a worker where a DAG will eventually run could parse the DAG and
report
back a meta description of the DAG, which could simplify and optimize
performance of the scheduler at the cost of network roundtrips.
- About redeploying instances: We see this as a potential issue for our
setup. My take is that jobs simply shouldn't take that much time in
principle to start with,
which avoids having to worry about this. If that's ridiculous, shouldn't
it be a concern of the environment airflow runs in rather than airflow
itself? I.e....
further tool out kubernetes CLI's / operators to query the environment
to plan/deny/schedule this kind of work automatically. Beacuse k8s was
probably
built from the perspective of handling short-running queries, running
anything long-term on that is going to naturally compete with the
architecture.
- About failures and instances disappearing on failure: it's not desirable
to keep those instances around for a long time, we really do need to depend
on
client logging and other services available to tell us what happened.
The difference in thinking is that a pod/container is just a temporary
thing that runs a job
and we should be interested in how the job did vs. how the container/pod
ran this. From my little experience with k8s though, I do see that it tends
to
get rid of everything a little bit too quick on failure. One thing you
could look into is to log onto a commonly shared volume with a specific
'key' for that container,
so you can always refer back to the important log file and fish this
out, with measures to clean up the shared filesystem on a regular basis.
- About rescaling and starting jobs: it doesn't come for free as you
mention. I think it's a great idea to be able to scale out on busy
intervals (we intend to just use cycle scaling here),
but a hint towards what policy or scaling strategy you intend to use on
k8s is welcome there.
Gerard
On Wed, Jul 5, 2017 at 8:43 PM, Daniel Imberman <[email protected]>
wrote:
> @amit
>
> I've added the proposal to the PR for now. Should make it easier for people
> to get to it. Will delete once I add it to the wiki.
> https://github.com/bloomberg/airflow/blob/29694ae9903c4dad3f18fb8eb767c4
> 922dbef2e8/dimberman-KubernetesExecutorProposal-050717-1423-36.pdf
>
> Daniel
>
> On Wed, Jul 5, 2017 at 11:36 AM Daniel Imberman <[email protected]
> >
> wrote:
>
> > Hi Amit,
> >
> > For now the design doc is included as an attachment to the original
> email.
> > Once I am able to get permission to edit the wiki I would like add it
> there
> > but for now I figured that this would get the ball rolling.
> >
> >
> > Daniel
> >
> >
> > On Wed, Jul 5, 2017 at 11:33 AM Amit Kulkarni <[email protected]> wrote:
> >
> >> Hi Daniel,
> >>
> >> I don't see link to design PDF.
> >>
> >>
> >> Amit Kulkarni
> >> Site Reliability Engineer
> >> Mobile: (716)-352-3270 <(716)%20352-3270>
> >>
> >> Payments partner to the platform economy
> >>
> >> On Wed, Jul 5, 2017 at 11:25 AM, Daniel Imberman <
> >> [email protected]>
> >> wrote:
> >>
> >> > Hello Airflow community!
> >> >
> >> > My name is Daniel Imberman, and I have been working on behalf of
> >> Bloomberg
> >> > LP to create an airflow kubernetes executor/operator. We wanted to
> allow
> >> > for maximum throughput/scalability, while keeping a lot of the
> >> kubernetes
> >> > details abstracted away from the users. Below I have a link to the WIP
> >> PR
> >> > and the PDF of the initial proposal. If anyone has any
> >> comments/questions I
> >> > would be glad to discuss this feature further.
> >> >
> >> > Thank you,
> >> >
> >> > Daniel
> >> >
> >> > https://github.com/apache/incubator-airflow/pull/2414
> >> >
> >>
> >
>