Hi all,

We've managed to find something of a work-around to this problem. By having 
only a single hub provide placeholders, and telling all hubs to use the 
userScheduler from that special hub, we've managed to avoid the OutOfmemory 
pods for the last few days. (For those of you playing along at home, set 
c.KubeSpawner.scheduler_name = '<special hub namespace>-user-scheduler' in the 
JupyterHub config file for all hubs other than the special one. Or all of them; 
it already defaults to that in the special hub.)
I don't particularly like having one hub that's special in some way, so I'm 
still open for other ideas on how to fix this. Perhaps it's possible to deploy 
a the userScheduler as part of its own namespace, independent of any hub?
I appreciate all hints and guesses you have to offer,
Robert

On Dec 3 2020, at 10:41 am, Robert Schroll <[email protected]> wrote:
> Hi all,
>
> We've run into a problem with a JupyterHub on Kubernetes set up (based on 
> Z2JH), where user pods sometimes end up in an OutOfmemory state on startup. 
> This seems to happen only with multiple hubs and user placeholders. Has 
> anyone else run into this issue?
> Our set up is a single Kuberenetes cluster. Nodes are sized to support a 
> single user pod at a time, and we're relying on autoscaling to set the 
> cluster size appropriately. We typically run several hubs on this cluster, 
> and we've recently be experimenting with the user placeholders to speed start 
> up time.
> As best I can tell, the issue occurs as follows: Let's suppose we're running 
> two hubs, alpha and beta. Alpha has two placeholders running, ph-1 and ph-2. 
> Since each takes up a full node, they are running on node-a and node-b. Now, 
> on the beta hub, user-beta-1 starts their server. There is no extra space, so 
> ph-1 gets evicted, and user-beta-1 is assigned to node-a. This all works 
> fine, even with the ph-1 and user-beta-1 being from different namespaces. The 
> cluster notices the unschedulable pod (ph-1) and starts scaling up. Before 
> the new node is ready, another user from the beta hub, user-beta-2, starts 
> their server. Again, there is no extra space, so ph-2 is evicted. Now, 
> however, both user-beta-2 and ph-1 (which was waiting for space to open up) 
> get assigned to node-b (where ph-2 just left). ph-1 starts up more quickly 
> and reserves the node's resources, so when user-beta-2 starts, it finds 
> insufficient resources. (In our setup, memory is the critical limit.) 
> user-beta-2 enters an 
OutOfmemory state, where it sulks until I come around and delete the pod. (Even 
if we remove other pods from the node, it never recovers.)
> One worry is that there is some mismatch in priorities between the different 
> namespaces. But I don't think that is the (entire) issue -- placeholders from 
> one namespace are evicted to make room for pods from another. I think it has 
> more to do with assignment of waiting pods to (newly empty) nodes. As I 
> understand it, this is done by the userScheduler, and there's one per hub 
> namespace. Perhaps the two schedulers are making inconsistent decisions?
> One solution would be to give each hub its own node pool, so that pods could 
> only evict placeholders from the same hub. I'd like to avoid that if 
> possible. Our hubs have different usage patterns, so it's nice to have one 
> large pool of placeholders that can server whichever hub is seeing the most 
> use at any given time.
> I wonder if it's possible to run a single userScheduler for all the hubs. 
> Perhaps this would force it to consider user pods from all namespaces when 
> making decisions. But I don't know how to go about doing this offhand.
> Another solution would be to find a way to restart user pods that get into 
> the OutOfmemory state. If I delete the pod by hand and then restart it from 
> the web interface (which itself requires a restart to notice that the user 
> pod has gone away), it will come up just fine, even kicking out the 
> placeholder that beat it before. Running a cron job that could do this every 
> minute would be a fine stop-gap solution. But again, I'm a bit out of my 
> depth here.
> Any ideas or suggestions? I'll readily admit that there's a 50/50 chance I've 
> misdiagnosed at least a part of the problem, so I'm happy to run any 
> additional diagnostics that might clear things up.
> Thanks,
> Robert

-- 
You received this message because you are subscribed to the Google Groups 
"Project Jupyter" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/jupyter/5EB2885C-CE02-42EE-95E8-54DF68B2EF1D%40getmailspring.com.

Reply via email to