Re: Airflow - High Availability and Scale Up vs Scale Out

Yacine Chantit Mon, 11 Jun 2018 01:27:15 -0700

We are using AWS ECS to deploy airflow and we rely on it to have some kind of 
high availability and scaling workers.


We have defined 3 ECS services : scheduler / webserver / worker.

Scheduler and Webserver are each running on single container.

Worker service can scale to as many number of containers as we want we have 
actually 3 workers running within the worker service.
We use ECS service scheduler to make sure there is always one airflow scheduler 
running in fact we start the airflow scheduler with run-duration param to 10min 
so it gets restarted continuously by ECS.

We have also defined a health check endpoints to check the health of all 
airflow processes. For instance to check the health of the scheduler with a 
system health dag that spin 3 dummy tasks that will write some logs on S3 and 
fire an event to newrelic. The scheduler healthcheck endpoint will just check 
that there is a task instance log for the last dagrun and we use newrelic 
sys_health events to define alerts. The healthcheck endpoints are used by ECS 
to check that each of airflow ECS services are healthy.

We are also deploy our dags inside docker image when it's built so we have 
immutable image. It's not ideal for small dag change to rebuild the image and 
redploy the whole airflow cluster but it's is simpler enough than having to 
deal mounted volumes. We put our logs on S3 so we don't mind killing containers 
so often.

It's working fine so far but we have just started (plan to migrate few hundres 
dags from another workflow tools) and only have few dags on airflow so I don't 
know if we are going to keep this once we have a few dozens dag changes 
everyday.

Regards,
Yacine

On 10/06/2018, 21:04, "Ali Uz" <[email protected]> wrote:

    We also run one beefy box in AWS ECS with the scheduler and webserver
    running on the same container. However, we have run into issues with this
    approach as the scheduler does fail at times and our DAGs get stuck until I
    have to manually restart the container.
    What approaches do you guys use to restart the scheduler automatically when
    it's stuck and/or failed?

    - Ali

    On Sun, Jun 10, 2018 at 8:44 PM Bolke de Bruin <[email protected]> wrote:

    > If you are running on one big box, you most certainly want to put the
    > scheduler in its own cgroup and run the tasks with sudo it their own.
    > Otherwise your availability might suffer.
    >
    > B.
    >
    > Verstuurd vanaf mijn iPad
    >
    > > Op 10 jun. 2018 om 16:30 heeft Sam Sen <[email protected]> het
    > volgende geschreven:
    > >
    > > Wouldn't you want immutable containers, hence, baking in the code in the
    > > container would be more ideal?
    > >
    > >> On Sun, Jun 10, 2018, 9:53 AM Arash Soheili <[email protected]>
    > wrote:
    > >>
    > >> We are just starting out but our setup is 2 EC2 with one running the 
web
    > >> server and scheduler and the other having multiple workers. The
    > database is
    > >> an RDS which both are connected to as well as Redis on AWS elastic 
cache
    > >> for the Celery connection.
    > >>
    > >> All 4 services are run in containers with systemd and we use CodeDeploy
    > and
    > >> sync up the code by mapping volumes from local file to the container. 
We
    > >> are not yet heavy users of Airflow so I can't speak to performance and
    > >> scale up just yet.
    > >>
    > >> In general I think an AMI with baked in code can be brittle and hard to
    > >> maintain and update. Container is the way to go as you can bake in the
    > code
    > >> in the image if you want. We have chosen not to do that and rely on
    > volume
    > >> mapping to update the latest code in the container. This makes it 
easier
    > >> that you don't need to keep creating new images.
    > >>
    > >> Arash
    > >>
    > >>> On Sat, Jun 9, 2018 at 9:47 AM Naik Kaxil <[email protected]> wrote:
    > >>>
    > >>> Let us know after trying the beefy box approach about your findings.
    > >>>
    > >>> On 08/06/2018, 12:24, "Sam Sen" <[email protected]> wrote:
    > >>>
    > >>>    We are facing this now. We have tried the celeryexecutor and it 
adds
    > >>> more
    > >>>    moving parts. While we have no thrown out this idea, we are going 
to
    > >>> give
    > >>>    one big beefy box a try.
    > >>>
    > >>>    To handle the HA side of things, we are putting the server in an
    > >>>    auto-scaling group (we use AWS) with a min and Max of 1 server. We
    > >>> deploy
    > >>>    from an AMI that has airflow baked in and we point the DB config to
    > >> an
    > >>> RDS
    > >>>    using service discovery (consul).
    > >>>
    > >>>    As for the dag code, we can either bake it into the AMI as well or
    > >>> install
    > >>>    it on bootup. We haven't decided what to do for this but either 
way,
    > >> we
    > >>>    realize it could take a few minutes to fully recover in the event 
of
    > >> a
    > >>>    catastrophe.
    > >>>
    > >>>    The other option is to have a standby server if using celery isn't
    > >>> ideal.
    > >>>    With that, I have tried using Hashicorp nomad to handle the
    > services.
    > >>> In my
    > >>>    limited trial, it did what we wanted but we need more time to test.
    > >>>
    > >>>>    On Fri, Jun 8, 2018, 4:23 AM Naik Kaxil <[email protected]> wrote:
    > >>>>
    > >>>> Hi guys,
    > >>>>
    > >>>>
    > >>>>
    > >>>> I have 2 specific questions for the guys using Airflow in
    > >> production?
    > >>>>
    > >>>>
    > >>>>
    > >>>>   1. How have you achieved High availability? How does the
    > >>> architecture
    > >>>>   look like? Do you replicate the master node as well?
    > >>>>   2. Scale Up vs Scale Out?
    > >>>>      1. What is the preferred approach you take? 1 beefy Airflow
    > >> VM
    > >>> with
    > >>>>      Worker, Scheduler and Webserver using Local Executor or a
    > >>> cluster with
    > >>>>      multiple workers using Celery Executor.
    > >>>>
    > >>>>
    > >>>>
    > >>>> I think this thread should help others as well with similar
    > >> question.
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Regards,
    > >>>>
    > >>>> Kaxil
    > >>>>
    > >>>>
    > >>>>
    > >>>>
    > >>>> Kaxil Naik
    > >>>>
    > >>>> Data Reply
    > >>>> 2nd Floor, Nova South
    > >>>> 160 Victoria Street, Westminster
    > >>>> London SW1E 5LB - UK
    > >>>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>>> [email protected]
    > >>>> www.reply.com
    > >>>>
    > >>>> [image: Data Reply]
    > >>>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>>
    > >>> Kaxil Naik
    > >>>
    > >>> Data Reply
    > >>> 2nd Floor, Nova South
    > >>> 160 Victoria Street, Westminster
    > >>> London SW1E 5LB - UK
    > >>> phone: +44 (0)20 7730 6000 <+44%2020%207730%206000>
    > >>> [email protected]
    > >>> www.reply.com
    > >>>
    > >>
    >


The information in this email (and any attachments) is confidential and is 
intended solely for the use of the individual or entity to whom it is 
addressed. If you received this email in error please tell us by reply email 
(or telephone the sender) and delete all electronic copies on your system or 
other copies known to you. Trainline Investments Holdings Limited (Registered 
No.5776685), Trainline.com Limited (Registered No. 3846791) and Trainline 
International Limited (Registered No. 6881309) are all registered in England 
and Wales with registered office at 3rd floor, 120 Holborn, London, EC1N 2TD.

Re: Airflow - High Availability and Scale Up vs Scale Out

Reply via email to