Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Daniel Imberman Thu, 30 Aug 2018 23:00:38 -0700

Also worth mentioning that when you restart the scheduler it will use ETCD
and postgres to recreate state so you won't end up re-launching
tasks/missing tasks


On Thu, Aug 30, 2018, 12:54 PM Eamon Keane <eamon.kea...@gmail.com> wrote:

> Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran
> into that limit. The next possible limit might be etcd, as pod creation is
> expensive so if there were a lot of short lived pods you might run into
> issues (e.g. k8s API refusing connections) or so a google SRE tells me.
>
> On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <g...@astronomer.io> wrote:
>
> > Yep, that should work fine. Pgbouncer is pretty configurable, so you can
> > play around with different settings for your environment. You can set
> > limits on the amount of connections you want to the actual database and
> > point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In
> my
> > experience, you can get away with a pretty low amount of actual
> connections
> > to postgres. Pgbouncer has some tools to observe the count of clients
> > (airflow processes), the amount of actual connections to the database, as
> > well as the number of waiting clients. You should be able to tune your
> > max_connections to the point where you have little to no clients waiting,
> > but using a dramatically lower number of actual connections to postgres.
> >
> > That chart also deploys a sidecar to pgbouncer that exports the metrics
> for
> > Prometheus to scrape. Here's an example Grafana dashboard that we use to
> > keep an eye on things -
> >
> >
> https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json
> > .
> >
> > On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <eamon.kea...@gmail.com>
> > wrote:
> >
> > > Interesting, Greg. Do you know if using pg_bouncer would allow you to
> > have
> > > more than 100 running k8s executor tasks at one time if e.g. there is a
> > 100
> > > connection limit on gcp instance?
> > >
> > > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <g...@astronomer.io>
> > wrote:
> > >
> > > > Good point Eamon, maxing connections out is definitely something to
> > look
> > > > out for. We recently added pgbouncer to our helm charts to pool
> > > connections
> > > > to the database for all the different airflow processes. Here's our
> > chart
> > > > for reference -
> > > >
> > > >
> > >
> >
> https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow
> > > >
> > > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <hamlin...@gmail.com>
> > wrote:
> > > >
> > > > > Thanks for your responses! Glad to hear that tasks can run
> > > independently
> > > > if
> > > > > something happens.
> > > > >
> > > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <
> eamon.kea...@gmail.com>
> > > > > wrote:
> > > > >
> > > > > > Adding to Greg's point, if you're using the k8s executor and for
> > some
> > > > > > reason the k8s executor worker pod fails to launch within 120
> > seconds
> > > > > (e.g.
> > > > > > pending due to scaling up a new node), this counts as a task
> > failure.
> > > > > Also,
> > > > > > if the k8s executor pod has already launched a pod operator but
> is
> > > > killed
> > > > > > (e.g. manually or due to node upgrade), the  pod operator it
> > launched
> > > > is
> > > > > > not killed and runs to completion so if using retries, you need
> to
> > > > ensure
> > > > > > idempotency. The worker pods update the db per my understanding,
> > with
> > > > > each
> > > > > > requiring a separate connection to the db, this can tax your
> > > connection
> > > > > > budget (100-300 for small postgres instances on gcp or aws).
> > > > > >
> > > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <
> g...@astronomer.io
> > >
> > > > > wrote:
> > > > > >
> > > > > > > Hey Kyle, the task pods will continue to run even if you reboot
> > the
> > > > > > > scheduler and webserver and the status does get updated in the
> > > > airflow
> > > > > > db,
> > > > > > > which is great.
> > > > > > >
> > > > > > > I know the scheduler subscribes to the Kubernetes watch API to
> > get
> > > an
> > > > > > event
> > > > > > > stream of pods completing and it keeps a checkpoint so it can
> > > > > resubscribe
> > > > > > > when it comes back up.
> > > > > > >
> > > > > > > I forget if the worker pods update the db or if the scheduler
> is
> > > > doing
> > > > > > > that, but it should work out.
> > > > > > >
> > > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin...@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > gentle bump
> > > > > > > >
> > > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin <
> > hamlin...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > I'm about to make the switch to Kubernetes with Airflow,
> but
> > am
> > > > > > > wondering
> > > > > > > > > what happens when my CI/CD pipeline redeploys the webserver
> > and
> > > > > > > scheduler
> > > > > > > > > and there are still long-running tasks (pods). My intuition
> > is
> > > > that
> > > > > > > since
> > > > > > > > > the database hold all state and the tasks are in charge of
> > > > updating
> > > > > > > their
> > > > > > > > > own state, and the UI only renders what it sees in the
> > database
> > > > > that
> > > > > > > this
> > > > > > > > > is not so much of a problem. To be sure, however, here are
> my
> > > > > > > questions:
> > > > > > > > >
> > > > > > > > > Will task pods continue to run?
> > > > > > > > > Can task pods continue to poll the external system they are
> > > > running
> > > > > > > tasks
> > > > > > > > > on while being "headless"?
> > > > > > > > > Can the tasks pods change/update state in the database
> while
> > > > being
> > > > > > > > > "headless"?
> > > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods)
> once
> > > > they
> > > > > > are
> > > > > > > > > live again?
> > > > > > > > >
> > > > > > > > > Is there anything else the might cause issues when
> deploying
> > > > while
> > > > > > > tasks
> > > > > > > > > (pods) are running that I'm not thinking of here?
> > > > > > > > >
> > > > > > > > > Kyle Hamlin
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Kyle Hamlin
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Kyle Hamlin
> > > > >
> > > >
> > > >
> > > > --
> > > > *Greg Neiheisel* / CTO Astronomer.io
> > > >
> > >
> >
> >
> > --
> > *Greg Neiheisel* / CTO Astronomer.io
> >
>

Re: Will redeploying webserver and scheduler in Kubernetes cluster kill running tasks

Reply via email to