Great I must give pgbouncer a try. Testing on GKE/cloudsql I quickly ran into that limit. The next possible limit might be etcd, as pod creation is expensive so if there were a lot of short lived pods you might run into issues (e.g. k8s API refusing connections) or so a google SRE tells me.
On Thu, Aug 30, 2018 at 8:21 PM Greg Neiheisel <[email protected]> wrote: > Yep, that should work fine. Pgbouncer is pretty configurable, so you can > play around with different settings for your environment. You can set > limits on the amount of connections you want to the actual database and > point your AIRFLOW__CORE__SQL_ALCHEMY_CONN to the pgbouncer service. In my > experience, you can get away with a pretty low amount of actual connections > to postgres. Pgbouncer has some tools to observe the count of clients > (airflow processes), the amount of actual connections to the database, as > well as the number of waiting clients. You should be able to tune your > max_connections to the point where you have little to no clients waiting, > but using a dramatically lower number of actual connections to postgres. > > That chart also deploys a sidecar to pgbouncer that exports the metrics for > Prometheus to scrape. Here's an example Grafana dashboard that we use to > keep an eye on things - > > https://github.com/astronomerio/astronomer/blob/master/docker/vendor/grafana/include/pgbouncer-stats.json > . > > On Thu, Aug 30, 2018 at 2:26 PM Eamon Keane <[email protected]> > wrote: > > > Interesting, Greg. Do you know if using pg_bouncer would allow you to > have > > more than 100 running k8s executor tasks at one time if e.g. there is a > 100 > > connection limit on gcp instance? > > > > On Thu, Aug 30, 2018 at 6:39 PM Greg Neiheisel <[email protected]> > wrote: > > > > > Good point Eamon, maxing connections out is definitely something to > look > > > out for. We recently added pgbouncer to our helm charts to pool > > connections > > > to the database for all the different airflow processes. Here's our > chart > > > for reference - > > > > > > > > > https://github.com/astronomerio/helm.astronomer.io/tree/master/charts/airflow > > > > > > On Thu, Aug 30, 2018 at 1:17 PM Kyle Hamlin <[email protected]> > wrote: > > > > > > > Thanks for your responses! Glad to hear that tasks can run > > independently > > > if > > > > something happens. > > > > > > > > On Thu, Aug 30, 2018 at 1:13 PM Eamon Keane <[email protected]> > > > > wrote: > > > > > > > > > Adding to Greg's point, if you're using the k8s executor and for > some > > > > > reason the k8s executor worker pod fails to launch within 120 > seconds > > > > (e.g. > > > > > pending due to scaling up a new node), this counts as a task > failure. > > > > Also, > > > > > if the k8s executor pod has already launched a pod operator but is > > > killed > > > > > (e.g. manually or due to node upgrade), the pod operator it > launched > > > is > > > > > not killed and runs to completion so if using retries, you need to > > > ensure > > > > > idempotency. The worker pods update the db per my understanding, > with > > > > each > > > > > requiring a separate connection to the db, this can tax your > > connection > > > > > budget (100-300 for small postgres instances on gcp or aws). > > > > > > > > > > On Thu, Aug 30, 2018 at 6:04 PM Greg Neiheisel <[email protected] > > > > > > wrote: > > > > > > > > > > > Hey Kyle, the task pods will continue to run even if you reboot > the > > > > > > scheduler and webserver and the status does get updated in the > > > airflow > > > > > db, > > > > > > which is great. > > > > > > > > > > > > I know the scheduler subscribes to the Kubernetes watch API to > get > > an > > > > > event > > > > > > stream of pods completing and it keeps a checkpoint so it can > > > > resubscribe > > > > > > when it comes back up. > > > > > > > > > > > > I forget if the worker pods update the db or if the scheduler is > > > doing > > > > > > that, but it should work out. > > > > > > > > > > > > On Thu, Aug 30, 2018, 9:54 AM Kyle Hamlin <[email protected]> > > > wrote: > > > > > > > > > > > > > gentle bump > > > > > > > > > > > > > > On Wed, Aug 22, 2018 at 5:12 PM Kyle Hamlin < > [email protected] > > > > > > > > wrote: > > > > > > > > > > > > > > > I'm about to make the switch to Kubernetes with Airflow, but > am > > > > > > wondering > > > > > > > > what happens when my CI/CD pipeline redeploys the webserver > and > > > > > > scheduler > > > > > > > > and there are still long-running tasks (pods). My intuition > is > > > that > > > > > > since > > > > > > > > the database hold all state and the tasks are in charge of > > > updating > > > > > > their > > > > > > > > own state, and the UI only renders what it sees in the > database > > > > that > > > > > > this > > > > > > > > is not so much of a problem. To be sure, however, here are my > > > > > > questions: > > > > > > > > > > > > > > > > Will task pods continue to run? > > > > > > > > Can task pods continue to poll the external system they are > > > running > > > > > > tasks > > > > > > > > on while being "headless"? > > > > > > > > Can the tasks pods change/update state in the database while > > > being > > > > > > > > "headless"? > > > > > > > > Will the UI/Scheduler still be aware of the tasks (pods) once > > > they > > > > > are > > > > > > > > live again? > > > > > > > > > > > > > > > > Is there anything else the might cause issues when deploying > > > while > > > > > > tasks > > > > > > > > (pods) are running that I'm not thinking of here? > > > > > > > > > > > > > > > > Kyle Hamlin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Kyle Hamlin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Kyle Hamlin > > > > > > > > > > > > > -- > > > *Greg Neiheisel* / CTO Astronomer.io > > > > > > > > -- > *Greg Neiheisel* / CTO Astronomer.io >
