On Tue, Jan 7, 2020 at 4:26 PM Maciej Jaros <[email protected]> wrote: > > The problem is I didn't shut it down. So what did?
Routine maintenance on the grid engine nodes. The shutdown timestamps line up with this SAL entry [0]: "22:58 <bd808> Depooling tools-sgewebgrid-lighttpd-090[2-9]". The depooling process is intended to restart running webservice workloads on new nodes in the cluster, but apparently in this case it did not. Sadly this is not horribly surprising. Grid engine is not very good at tracking system state compared to the Kubernetes cluster in Toolforge. If your tool is capable of running on our Kubernetes system (uses one language runtime and does not rely on special software installed globally) then migrating from Grid Engine to Kubernetes will almost certainly leave you with a more stable webservice. See the Wikitech page on the last Grid Engine migration [1] for some hints on how to migrate. [0]: https://tools.wmflabs.org/sal/log/AW99FPYQfYQT6VcDfz3h [1]: https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_grid_engine_webservice Bryan -- Bryan Davis Technical Engagement Wikimedia Foundation Principal Software Engineer Boise, ID USA [[m:User:BDavis_(WMF)]] irc: bd808 _______________________________________________ Wikimedia Cloud Services mailing list [email protected] (formerly [email protected]) https://lists.wikimedia.org/mailman/listinfo/cloud
