Hello all: I've got some problems with my users and ram usage. One in partular is doing some algorithm design, and knows his code is buggy. He's trying to debug it, but needs to do so on a cluster. As such, he's trying to set memory limits to ensure he doesn't crash nodes (which 50% of his jobs end up crashing its first node, and consequently leaving about half the cluster "locked" until I catch it and restart the crashed node). However, we've had lots of problems getting memory limits to work correctly.
What we need is: Specify that the first "node" / "process" be given one limit, and the remainder of the nodes be given a different limit (e.g, first process be given 2GB, remainder given 1GB). If we try and just assign 2GB accross the cluster, then half the cluster would go unused (as we have 8GB ram per node and 8 cores per node). Also, we've had some problems correctly enforcing memory limits. We want per-process limits, not job-total limits, and when we tried the variouis mem= options, we got results that differed from what we expected based on reading the approprate man page. Any suggestions? This is currently the single largest problem facing our ROCKS cluster, and has poised a signifiant reliability problem. Thanks! --Jim Admin of "aeolus", 24 8-core, 8gb nodes. _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
