I think i found the bug when we have more then 20,000 jobs maui crashed with a segv, because MJob was not circular anymore the last element did not point to the first entry in the list.

We got a lot of these errors in maui.log:
11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob [21208] '43806' 11/01 21:12:13 ERROR: job buffer is full (ignoring job '43806.testm.irc.sara.nl') 11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob [21218] '43817' 11/01 21:12:13 ERROR: job buffer is full (ignoring job '43817.testm.irc.sara.nl') 11/01 21:12:13 ERROR: job hash table is FULL. cannot add MJob [21228] '43828' 11/01 21:12:13 ERROR: job buffer is full (ignoring job '43828.testm.irc.sara.nl') 11/01 21:12:14 ERROR: job hash table is FULL. cannot add MJob [21238] '43839' 11/01 21:12:14 ERROR: job buffer is full (ignoring job '43839.testm.irc.sara.nl') 11/01 21:12:15 ERROR: job hash table is FULL. cannot add MJob [21304] '43906' 11/01 21:12:15 ERROR: job buffer is full (ignoring job '43906.testm.irc.sara.nl') 11/01 21:12:15 ERROR: job hash table is FULL. cannot add MJob [21314] '43917' 11/01 21:12:15 ERROR: job buffer is full (ignoring job '43917.testm.irc.sara.nl') 11/01 21:12:16 ERROR: job hash table is FULL. cannot add MJob [21324] '43928' 11/01 21:12:16 ERROR: job buffer is full (ignoring job '43928.testm.irc.sara.nl') 11/01 21:12:16 ERROR: job hash table is FULL. cannot add MJob [21334] '43939' 11/01 21:12:16 ERROR: job buffer is full (ignoring job '43939.testm.irc.sara.nl') 11/01 21:12:19 ERROR: job hash table is FULL. cannot add MJob [21464] '44070' 11/01 21:12:19 ERROR: job buffer is full (ignoring job '44070.testm.irc.sara.nl') 11/01 21:12:19 ERROR: job hash table is FULL. cannot add MJob [21474] '44081'

IIt goes wrong in MJob.c. It computes a hashkey for the job and this hashkey is used as starting value for loop, the end value of the loop is MMAX_JOB + MAX_MHBUF. When all entires in this range are occupied it will display the error above. The patch that i have written will slowdown the server, because it will search from 0 till the end for free slots.

I am trying to optimize it. It will now run jobs and not segv at the start


  /* HvB bas
DBG(1,fSTRUCT) DPrint("ERROR: MSched.M[mxoJob] = %d\n", MSched.M [mxoJob]);
  for (index = hashkey;index < MSched.M[mxoJob] + MAX_MHBUF;index++)
  */
  for (index = 0;index < MSched.M[mxoJob] + MAX_MHBUF;index++)


I have written something that first tries the computed haskey and then the brute force method. I am now testing the patch.

Regards

--
Bas van der Vlies
[EMAIL PROTECTED]



_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to