Hi.

For whatever it is worth, Maui has some serious bugs when it is in full use.

I had Maui running for a VERY long time and it would behave differently 
when it was mostly idle as when it was under heavy use - we have 
thousands of cores.

In my frustration I downloaded and enabled "moab" eval and as if by 
magic all of the weirdness we were seeing in Maui went away over a 
2-month period.   After two months of use, when I reverted back to Maui, 
all of the same weirdness came back.

We eventually dropped Maui and went with Son of Grid Engine as Moab was 
price prohibited for us.   Grid Engine has been working very well albeit 
via several home grown custom modifications.

Joseph



On 12/09/2015 12:37 PM, Michel Béland wrote:
> Bas van der Vlies wrote:
>
>> Dear Michel,
>>
>> What I read from the code (It is while back that I did that, but we are in 
>> the process of patching some maui stuff). The
>>     * N-TaskCount is number of jobs on the node
>>     * and N->DRes.Procs —> Which resources are given to the consumer,
>>
>> So you can have a node with 16 cores and if you have an share node:
>>    * 4 nodes dat consume all cores
> You mean 4 jobs, I guess?
>
>> With the  N->DRes.Procs  you can determine if there are slot available for 
>> other jobs, eg:
>>    * 5 jobs each 2 core
>>    *  N->DRes.Procs  will be 10
>>    * and still 6 slots available
>>
>> That is what I read.
> Hmm... Looking at the code, I see for example that functions
> MPBSNodeLoad() and MPBSNodeUpdate() in file MPBSI.c both loop on the
> comma-separated tokens of the "jobs" attribute of a node by incrementing
> N->TaskCount for each token. This is a while loop using
>
> ptr = MUStrTok(tmpBuffer,", \t",&TokPtr);
>
> to get the first token and
>
> ptr = MUStrTok(NULL,", \t",&TokPtr);
>
> to get the next one, pretty similar to the C function strtok() and the
> POSIX function strtok_r().
>
> This means that with a jobs attribute looking like this:
>
> 0/48.server, 1/48.server, 2/49.server, 3/49.server
>
> N->TaskCount will end up taking the value 4. It means that it counts the
> number of processors, not the number of jobs on the node.
>
> Lines 3188 to 3209 of MPBSI.c are where MPBSNodeUpdate() treats one
> token by incrementing N->TaskCount and extracting the JobID. That is
> after that it gets interesting. N->DRes.Procs is incremented this way:
>
> 3211        if (MJobFind(JobID,&J,0) == SUCCESS)
> 3212           {
> 3213           if (J->Req[0]->DRes.Procs == -1)
> 3214             {
> 3215             tmpProcs = N->CRes.Procs;
> 3216             }
> 3217           else
> 3218             {
> 3219             tmpProcs = MAX(1,J->Req[0]->DRes.Procs);
> 3220             }
> 3221
> 3222           N->DRes.Procs = MIN(N->DRes.Procs + tmpProcs,N->CRes.Procs);
> 3223
>
> If I understand correctly, J->Req[0]->DRes.Procs) is the number of
> processors dedicated to the job. But wait, what tells us that these
> processors are on the current node? The only way I think this code might
> work is if J->Req[0]->DRes.Procs == 0, making tmpProcs equal to 1. Then
> N->DRes.Procs and N->TaskCount are the same...
>
> MPBSNodeLoad() has similar looking code, but nor exactly the same:
>
> 2514         if (MJobFind(JobID,&J,0) == SUCCESS)
> 2515           {
> 2516           N->DRes.Procs += MAX(1,J->Req[0]->DRes.Procs);  /* FIXME */
>
> I will experiment with a debugger on an old cluster running an earlier
> version of Torque and see what value J->Req[0]->DRes.Procs has.
>
>
>
>>> On 8 dec. 2015, at 23:09, Michel Béland <michel.bel...@calculquebec.ca> 
>>> wrote:
>>>
>>> Hello,
>>>
>>> I am trying to modify Maui to understand correctly the "exec_host" job
>>> attribute and the "jobs" node attribute. While reading the code, I
>>> wondered what was the meaning of parts of the data structure.
>>>
>>> So what is the difference between N->TaskCount and N->DRes.Procs, where
>>> N is a pointer of type mnode_t? The comments do not help a lot.
>>>
>>> -- 
>>> Michel Béland, analyste en calcul scientifique
>>> michel.bel...@calculquebec.ca
>>> bureau S-250, pavillon Roger-Gaudry (principal), Université de Montréal
>>> téléphone : 514 343-6111 poste 3892     télécopieur : 514 343-2155
>>> Calcul Québec (www.calculquebec.ca)
>>> Calcul Canada (calculcanada.ca)
>>>
>>> _______________________________________________
>>> mauiusers mailing list
>>> mauiusers@supercluster.org
>>> http://www.supercluster.org/mailman/listinfo/mauiusers
>> --
>> Bas van der Vlies
>> | Operations, Support & Development | SURFsara | Science Park 140 | 1098 XG  
>> Amsterdam
>> | T +31 (0) 20 800 1300  | bas.vandervl...@surfsara.nl | www.surfsara.nl |
>>
>>
>>
>>
>

_______________________________________________
mauiusers mailing list
mauiusers@supercluster.org
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to