>> Have you tried to recompile maui with larger limits?
>> 
>> sed -i -e "/MAX_MRES/ s/1024/8192/g" include/moab.h  
>> sed -i -e "/MMAX_JOB/ s/4096/8192/g" ./include/msched.h
>> 
>> There might be others that need to be increased too.
> 
> Ok, I specifically mentioned that the level varies daily and therefore is 
> probably not related to those limits. It sometimes blocks at 3200 sometimes 
> 4200, numbers chosen randomly. The maui is compiled by EMI and used in 
> clusters with far more cores/running jobs so it's not a limit issue, it's a 
> runtime state issue. 


After increasing the log level the message that seems to cause this is:

01/10 17:19:31 ALERT:    corruption found on iteration 0 in location 
MJobGetSNRange-Start on node wn-v-3800.local

it then re-iterates and chooses another node and claims the same error. Looking 
at the ALERT output:

01/10 17:37:02 ALERT:    corruption found on iteration 1 in location 
MResAdjustDRes-Start on node wn-v-4520.local
01/10 17:37:02 ALERT:    R[023] 2097627 started but not ended
01/10 17:37:02 ALERT:    R[023] 2097627 has no associated events
01/10 17:37:02 ALERT:    corruption found on iteration 1 in location 
MResAdjustDRes-End on node wn-v-4520.local
01/10 17:37:02 ALERT:    R[023] 2097627 started but not ended
01/10 17:37:02 ALERT:    R[023] 2097627 has no associated events
01/10 17:37:02 ALERT:    corruption found on iteration 1 in location 
MResAdjustDRes-Start on node wn-v-4664.local
01/10 17:37:02 ALERT:    R[023] 2098811 started but not ended
01/10 17:37:02 ALERT:    R[023] 2098811 has no associated events
01/10 17:37:02 ALERT:    corruption found on iteration 1 in location 
MResAdjustDRes-End on node wn-v-4664.local
01/10 17:37:02 ALERT:    R[023] 2098811 started but not ended
01/10 17:37:02 ALERT:    R[023] 2098811 has no associated events
01/10 17:37:02 ALERT:    corruption found on iteration 1 in location 
MResAdjustDRes-Start on node wn-v-4772.local
01/10 17:37:02 ALERT:    R[023] 2097181 started but not ended
01/10 17:37:02 ALERT:    R[023] 2097181 has no associated events

the list of nodes varies a lot so difficult to pinpoint. 


Mario Kadastik, PhD
Researcher

---
  "Physics is like sex, sure it may have practical reasons, but that's not why 
we do it" 
     -- Richard P. Feynman

_______________________________________________
mauiusers mailing list
[email protected]
http://www.supercluster.org/mailman/listinfo/mauiusers

Reply via email to