On Wed, 2007-03-28 at 15:19 -0700, Jay Srinivasan wrote: > Garrick Staples wrote: > > On Wed, Mar 28, 2007 at 12:16:16AM -0700, Jay Srinivasan alleged: > >> Hi, > >> > >> In moab/MRes.c in the MNodeUpdateResExpression() routine (around line > >> 4075 in Maui-3.2.6p19), the check for MaxTasks and TaskCount, which is > >> > >> if ((R->MaxTasks > 0) && (R->TaskCount >= R->MaxTasks)) continue; > >> > >> I think, will check to see if the task count for the SR is more than the > >> SRMAXTASKS parameter and then continue to the next SR and not update the > >> current SR with the node(s) in the RegExp under consideration. > >> > >> But, in Maui atleast, it does not seem that the SRMAXTAKS parameter is > >> even honored (nor do setres or MResCreate() even take it as a > >> parameter), and so it seems that MaxTasks is always zero in this case > >> for SRs. > >> > >> Thus, everytime a pbs_mom is recycled, this routine ends up adding the > >> node that just came up to the SR nodelist, whether the node was on the > >> list originally or not. This results in the SR gradually growing in size. > >> > >> I think the fix for this is to simply check for a possible MaxTasks > >> value of 0 as well, i.e. > >> > >> if ((R->MaxTasks >= 0) && (R->TaskCount >= R->MaxTasks)) continue; > >> > >> Could someone who has a better knowledge of Maui internals please > >> confirm that this is the case or let me know if I am not correct? > > > > I can't comment directly on the problem, but I can say that Maui doesn't > > talk to pbs_mom and I can't think of any reason why restarting pbs_mom > > could effect Maui. > > > > Yes, perhaps not directly. But Maui has to know how many MOMs are > running and coordinate the node->SR mapping. So, when Maui does its > periodic scan and figures out that a node which was down has become > available again (either through Torque or PBSPro -- I have the problem > under both), it goes through the MNodeUpdateResExpression() code path > and tosses that node onto the SR nodelist always (whether or not the > node was on the SR nodelist to begin with).
Yes i agree, we have seen this behaviour too. That's one reason we stopped using SR's. -- Ake Sandgren, HPC2N, Umea University, S-90187 Umea, Sweden Internet: [EMAIL PROTECTED] Phone: +46 90 7866134 Fax: +46 90 7866126 Mobile: +46 70 7716134 WWW: http://www.hpc2n.umu.se _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
