Also worth mentioning that when you restart the scheduler it will use ETCD and postgres to recreate state so you won't end up re-launching tasks/missing tasks
On Thu, Aug 30, 2018, 12:54 PM Eamon Keane <eamon.kea...@gmail.com> wrote: > Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran > into that limit. The next possible limit might be etcd, as pod creation is > expensive so if there were a lot of short lived pods you might run into > issues (e.g. k8s API refusing connections) or so a google SRE tells me. > > On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <g...@astronomer.io> wrote: > > > Yep, that should work fine. Pgbouncer is pretty configurable, so you can > > play around with different settings for your environment. You can set > > limits on the amount of connections you want to the actual database and > > point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In > my > > experience, you can get away with a pretty low amount of actual > connections > > to postgres. Pgbouncer has some tools to observe the count of clients > > (airflow processes), the amount of actual connections to the database, as > > well as the number of waiting clients. You should be able to tune your > > max_connections to the point where you have little to no clients waiting, > > but using a dramatically lower number of actual connections to postgres. > > > > That chart also deploys a sidecar to pgbouncer that exports the metrics > for > > Prometheus to scrape. Here's an example Grafana dashboard that we use to > > keep an eye on things - > > > > > https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json > > . > > > > On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <eamon.kea...@gmail.com> > > wrote: > > > > > Interesting, Greg. Do you know if using pg_bouncer would allow you to > > have > > > more than 100 running k8s executor tasks at one time if e.g. there is a > > 100 > > > connection limit on gcp instance? > > > > > > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <g...@astronomer.io> > > wrote: > > > > > > > Good point Eamon, maxing connections out is definitely something to > > look > > > > out for. We recently added pgbouncer to our helm charts to pool > > > connections > > > > to the database for all the different airflow processes. Here's our > > chart > > > > for reference - > > > > > > > > > > > > > > https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow > > > > > > > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <hamlin...@gmail.com> > > wrote: > > > > > > > > > Thanks for your responses! Glad to hear that tasks can run > > > independently > > > > if > > > > > something happens. > > > > > > > > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane < > eamon.kea...@gmail.com> > > > > > wrote: > > > > > > > > > > > Adding to Greg's point, if you're using the k8s executor and for > > some > > > > > > reason the k8s executor worker pod fails to launch within 120 > > seconds > > > > > (e.g. > > > > > > pending due to scaling up a new node), this counts as a task > > failure. > > > > > Also, > > > > > > if the k8s executor pod has already launched a pod operator but > is > > > > killed > > > > > > (e.g. manually or due to node upgrade), the pod operator it > > launched > > > > is > > > > > > not killed and runs to completion so if using retries, you need > to > > > > ensure > > > > > > idempotency. The worker pods update the db per my understanding, > > with > > > > > each > > > > > > requiring a separate connection to the db, this can tax your > > > connection > > > > > > budget (100-300 for small postgres instances on gcp or aws). > > > > > > > > > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel < > g...@astronomer.io > > > > > > > > wrote: > > > > > > > > > > > > > Hey Kyle, the task pods will continue to run even if you reboot > > the > > > > > > > scheduler and webserver and the status does get updated in the > > > > airflow > > > > > > db, > > > > > > > which is great. > > > > > > > > > > > > > > I know the scheduler subscribes to the Kubernetes watch API to > > get > > > an > > > > > > event > > > > > > > stream of pods completing and it keeps a checkpoint so it can > > > > > resubscribe > > > > > > > when it comes back up. > > > > > > > > > > > > > > I forget if the worker pods update the db or if the scheduler > is > > > > doing > > > > > > > that, but it should work out. > > > > > > > > > > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <hamlin...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > > > gentle bump > > > > > > > > > > > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin < > > hamlin...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > I'm about to make the switch to Kubernetes with Airflow, > but > > am > > > > > > > wondering > > > > > > > > > what happens when my CI/CD pipeline redeploys the webserver > > and > > > > > > > scheduler > > > > > > > > > and there are still long-running tasks (pods). My intuition > > is > > > > that > > > > > > > since > > > > > > > > > the database hold all state and the tasks are in charge of > > > > updating > > > > > > > their > > > > > > > > > own state, and the UI only renders what it sees in the > > database > > > > > that > > > > > > > this > > > > > > > > > is not so much of a problem. To be sure, however, here are > my > > > > > > > questions: > > > > > > > > > > > > > > > > > > Will task pods continue to run? > > > > > > > > > Can task pods continue to poll the external system they are > > > > running > > > > > > > tasks > > > > > > > > > on while being "headless"? > > > > > > > > > Can the tasks pods change/update state in the database > while > > > > being > > > > > > > > > "headless"? > > > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods) > once > > > > they > > > > > > are > > > > > > > > > live again? > > > > > > > > > > > > > > > > > > Is there anything else the might cause issues when > deploying > > > > while > > > > > > > tasks > > > > > > > > > (pods) are running that I'm not thinking of here? > > > > > > > > > > > > > > > > > > Kyle Hamlin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Kyle Hamlin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Kyle Hamlin > > > > > > > > > > > > > > > > > -- > > > > *Greg Neiheisel* / CTO Astronomer.io > > > > > > > > > > > > > -- > > *Greg Neiheisel* / CTO Astronomer.io > > >