.... Would thanks you a lot for any help. May be someone using slurm+maui could set "LOGLEVEL 7" in maui.cfg and check, if logs are similar to mine. Especially to check, if match sizes in bytes in line with MSecGetChecksum and in previouse line. In my logs they do not match sometimes: 4435 and 4363, for example.
06/26 13:34:47 INFO: 4435 of 4435 bytes read from sd 7 06/26 13:34:47 MSecGetChecksum(Buf,4363,Checksum,DES,CSKey) Correct logs: 06/26 13:34:47 ServerProcessRequests() 06/26 13:34:47 MLogRoll(NULL,0,1) 06/26 13:34:47 INFO: not rolling logs (441447 < 10000000) 06/26 13:34:47 MResAdjust(NULL,0,0) 06/26 13:34:47 MJobSetAttr(,PAL,Value,1,2) 06/26 13:34:47 INFO: job flags for job : 0, req napolicy=SHARED 06/26 13:34:47 MJobSetAttr(,GAttr,Value,1,5) 06/26 13:34:47 MStatInitializeActiveSysUsage() 06/26 13:34:47 MStatClearUsage([NONE],Active) 06/26 13:34:47 ServerUpdate() 06/26 13:34:47 MSysUpdateTime() 06/26 13:34:47 INFO: starting iteration 60 06/26 13:34:47 MSchedProcessJobs() 06/26 13:34:47 MRMGetInfo() 06/26 13:34:47 MClusterClearUsage() 06/26 13:34:47 MRMClusterQuery() 06/26 13:34:47 MWikiClusterLoadInfo(n00,RCount,EMsg,SC) 06/26 13:34:47 MWikiDoCommand(n00,7321,9000000,CHECKSUM,CMD=GETNODES ARG=0:ALL,Data,DataSize,SC) 06/26 13:34:47 MSUConnect(S,FALSE,EMsg) 06/26 13:34:47 INFO: trying to connect to 10.1.0.1 (Port: 7321) 06/26 13:34:47 INFO: non-blocking mode established 06/26 13:34:47 MSUSelectWrite(7,9000000) 06/26 13:34:47 INFO: successful connect to TCP server (sd: 7) 06/26 13:34:47 MSUSendData(S,9000000,TRUE,FALSE) 06/26 13:34:47 MSecGetChecksum2(Buf1,27,Buf2,22,Checksum,DES,CSKey) 06/26 13:34:47 INFO: header created '00000069 CK=2c5f6971a5844eef TS=1246008887 AUTH=root DT=' 06/26 13:34:47 INFO: sending short packet '00000069 CK=2c5f6971a5844eef TS=1246008887 AUTH=root DT=CMD=GETNODES ARG=0:ALL' 06/26 13:34:47 MSUSendPacket(7,Buf,78,9000000,SC) 06/26 13:34:47 MSUSelectWrite(7,9000000) 06/26 13:34:47 INFO: packet sent (78 bytes of 78) 06/26 13:34:47 INFO: command sent to server 06/26 13:34:47 INFO: message sent: 'CMD=GETNODES ARG=0:ALL' 06/26 13:34:47 MSURecvData(S,9000000,TRUE,SC,EMsg) 06/26 13:34:47 MSURecvPacket(7,BufP,9,NULL,9000000,SC) 06/26 13:34:47 MSUSelectRead(7,9000000) 06/26 13:34:47 INFO: 9 of 9 bytes read from sd 7 06/26 13:34:47 MSURecvPacket(7,BufP,4435,NULL,9000000,SC) 06/26 13:34:47 MSUSelectRead(7,9000000) 06/26 13:34:47 INFO: 4435 of 4435 bytes read from sd 7 06/26 13:34:47 MSecGetChecksum(Buf,4363,Checksum,DES,CSKey) 06/26 13:34:47 ALERT: checksum does not match (351c7a893a2e1699:b4584308b241ec39) request 'TS=1246008887 AUTH=slurm DT=SC =0 ARG=64#n01:STATE=Running;ARCH=x86_64;OS=Linux;CMEMORY=10240;CDISK=0;CPROC=8;#n02:STATE=' 06/26 13:34:47 ERROR: cannot receive data from server n00:7321 06/26 13:34:47 MSUDisconnect(S) 06/26 13:34:47 ALERT: cannot get node list from WIKI RM 06/26 13:34:47 ALERT: cannot load cluster resources on RM (RM 'n00' failed in function 'clusterquery') 06/26 13:34:47 WARNING: no resources detected 06/26 13:34:47 MRMWorkloadQuery() 06/26 13:34:47 MWikiWorkloadQuery(n00,JCount,SC) 06/26 13:34:47 MWikiDoCommand(n00,7321,9000000,CHECKSUM,CMD=GETJOBS ARG=0:ALL,Data,DataSize,SC) 06/26 13:34:47 MSUConnect(S,FALSE,EMsg) 06/26 13:34:47 INFO: trying to connect to 10.1.0.1 (Port: 7321) 06/26 13:34:47 INFO: non-blocking mode established 06/26 13:34:47 MSUSelectWrite(7,9000000) 06/26 13:34:47 INFO: successful connect to TCP server (sd: 7) 06/26 13:34:47 MSUSendData(S,9000000,TRUE,FALSE) 06/26 13:34:47 MSecGetChecksum2(Buf1,27,Buf2,21,Checksum,DES,CSKey) 06/26 13:34:47 INFO: header created '00000068 CK=4e880ad31a667b74 TS=1246008887 AUTH=root DT=' 06/26 13:34:47 INFO: sending short packet '00000068 CK=4e880ad31a667b74 TS=1246008887 AUTH=root DT=CMD=GETJOBS ARG=0:ALL' 06/26 13:34:47 MSUSendPacket(7,Buf,77,9000000,SC) 06/26 13:34:47 MSUSelectWrite(7,9000000) 06/26 13:34:47 INFO: packet sent (77 bytes of 77) 06/26 13:34:47 INFO: command sent to server 06/26 13:34:47 INFO: message sent: 'CMD=GETJOBS ARG=0:ALL' 06/26 13:34:47 MSURecvData(S,9000000,TRUE,SC,EMsg) 06/26 13:34:47 MSURecvPacket(7,BufP,9,NULL,9000000,SC) 06/26 13:34:47 MSUSelectRead(7,9000000) 06/26 13:34:47 INFO: 3704 of 3704 bytes read from sd 7 06/26 13:34:47 MSecGetChecksum(Buf,3632,Checksum,DES,CSKey) 06/26 13:34:47 ALERT: checksum does not match (e3743199c5566b9a:9ab1d151dd49049c) request 'TS=1246008887 AUTH=slurm DT=SC =0 ARG=17#191814:STATE=Running;TASKLIST=:n01;UPDATETIME=1246007985;WCLIMIT=31536000;TASKS=' 06/26 13:34:47 ERROR: cannot receive data from server n00:7321 06/26 13:34:47 MSUDisconnect(S) 06/26 13:34:47 ALERT: cannot get job list from WIKI RM 06/26 13:34:47 ALERT: cannot load cluster workload on RM (RM 'n00' failed in function 'workloadquery') 06/26 13:34:47 WARNING: no workload detected _______________________________________________ mauiusers mailing list [email protected] http://www.supercluster.org/mailman/listinfo/mauiusers
