On Tue, Jan 7, 2020 at 4:26 PM Maciej Jaros <[email protected]> wrote:
>
> The problem is I didn't shut it down. So what did?

Routine maintenance on the grid engine nodes. The shutdown timestamps
line up with this SAL entry [0]: "22:58 <bd808> Depooling
tools-sgewebgrid-lighttpd-090[2-9]".

The depooling process is intended to restart running webservice
workloads on new nodes in the cluster, but apparently in this case it
did not. Sadly this is not horribly surprising. Grid engine is not
very good at tracking system state compared to the Kubernetes cluster
in Toolforge.

If your tool is capable of running on our Kubernetes system (uses one
language runtime and does not rely on special software installed
globally) then migrating from Grid Engine to Kubernetes will almost
certainly leave you with a more stable webservice. See the Wikitech
page on the last Grid Engine migration [1] for some hints on how to
migrate.

[0]: https://tools.wmflabs.org/sal/log/AW99FPYQfYQT6VcDfz3h
[1]: 
https://wikitech.wikimedia.org/wiki/News/Toolforge_Trusty_deprecation#Move_a_grid_engine_webservice

Bryan
-- 
Bryan Davis              Technical Engagement      Wikimedia Foundation
Principal Software Engineer                               Boise, ID USA
[[m:User:BDavis_(WMF)]]                                      irc: bd808

_______________________________________________
Wikimedia Cloud Services mailing list
[email protected] (formerly [email protected])
https://lists.wikimedia.org/mailman/listinfo/cloud

Reply via email to