I think i found the bug when we have more then 20,000 jobs maui
crashed with a segv, because MJob was not circular anymore the last
element did not point to the first entry in the list.
We got a lot of these errors in maui.log:
11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob
[21208] '43806'
11/01 21:12:13 ERROR: job buffer is full (ignoring job
'43806.testm.irc.sara.nl')
11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob
[21218] '43817'
11/01 21:12:13 ERROR: job buffer is full (ignoring job
'43817.testm.irc.sara.nl')
11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob
[21228] '43828'
11/01 21:12:13 ERROR: job buffer is full (ignoring job
'43828.testm.irc.sara.nl')
11/01 21:12:14 ERROR: job hash table is FULL. cannot add MJob
[21238] '43839'
11/01 21:12:14 ERROR: job buffer is full (ignoring job
'43839.testm.irc.sara.nl')
11/01 21:12:15 ERROR: job hash table is FULL. cannot add MJob
[21304] '43906'
11/01 21:12:15 ERROR: job buffer is full (ignoring job
'43906.testm.irc.sara.nl')
11/01 21:12:15 ERROR: job hash table is FULL. cannot add MJob
[21314] '43917'
11/01 21:12:15 ERROR: job buffer is full (ignoring job
'43917.testm.irc.sara.nl')
11/01 21:12:16 ERROR: job hash table is FULL. cannot add MJob
[21324] '43928'
11/01 21:12:16 ERROR: job buffer is full (ignoring job
'43928.testm.irc.sara.nl')
11/01 21:12:16 ERROR: job hash table is FULL. cannot add MJob
[21334] '43939'
11/01 21:12:16 ERROR: job buffer is full (ignoring job
'43939.testm.irc.sara.nl')
11/01 21:12:19 ERROR: job hash table is FULL. cannot add MJob
[21464] '44070'
11/01 21:12:19 ERROR: job buffer is full (ignoring job
'44070.testm.irc.sara.nl')
11/01 21:12:19 ERROR: job hash table is FULL. cannot add MJob
[21474] '44081'
IIt goes wrong in MJob.c. It computes a hashkey for the job and this
hashkey is used as starting value for loop, the end
value of the loop is MMAX_JOB + MAX_MHBUF. When all entires in this
range are occupied it will display the error above.
The patch that i have written will slowdown the server, because it
will search from 0 till the end for free slots.
I am trying to optimize it. It will now run jobs and not segv at the
start
/* HvB bas
DBG(1,fSTRUCT) DPrint("ERROR: MSched.M[mxoJob] = %d\n", MSched.M
[mxoJob]);
for (index = hashkey;index < MSched.M[mxoJob] + MAX_MHBUF;index++)
*/
for (index = 0;index < MSched.M[mxoJob] + MAX_MHBUF;index++)
I have written something that first tries the computed haskey and
then the brute force method. I am now testing the patch.
Regards
--
Bas van der Vlies
[EMAIL PROTECTED]
_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers