Yep, this is a common problem. I've never bothered to figure out why memory can't be allocated, although as you note the issue is in sgp_dd, not in the iokit scripts. Could be a resource limit of some sort (pinned pages?). If you have time to dig into it, I'm sure many people would appreciate it.
One thing to note is that Lustre limits itself to 512 total threads per server. So there are never more than that outstanding IOs when running Lustre, although additional client requests can be queued and processed, which is why higher crg/thread values are interesting. If you limit the sgpdd_survey total thread count, you should not have these failures (note that 1536 threads has one failing write process while 3072 has 140; perhaps you could have sgp_dd retry the allocation). Kevin Heald, Nathan T. wrote: > Hi everyone, > I have been running sgpdd-survey on some DDN 9550's and am getting some > errors. I'm using what I believe to be the latest version of the I/O Kit > (lustre-iokit-1.2-200709210921). I've got 4 OSSes attached and run > sgpdd-survey against all the disk from each host one at a time. Each host is > getting these errors, but not identically. I've found several threads on the > mailing list with people reporting this same error but there are no > resolutions posted. One post suggested a modification to the flags for > "sg_readcap" in the script could resolve these errors, but making the > changes did not seem to fix the issue. It looks like sgp_dd is having > intermittent problems: > > 16384+0 records out > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > sg starting in command at "sgp_dd.c":827: Cannot allocate memory > > > Output from sgpdd-survey: > > Wed Dec 1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn > /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx > /dev/sdq from oss1 > ... > total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384 > x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read > 385.72 MB/s 384 x 1.01 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121 > failed > total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM > total_size 100663296K rsz 1024 crg 768 thr 768 write 1 failed read > 387.28 MB/s 768 x 0.51 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 768 thr 1536 write 388.23 MB/s 768 > x 0.51 = 388.18 MB/s read 386.76 MB/s 768 x 0.51 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 768 thr 3072 write 42 failed read 31 > failed > total_size 100663296K rsz 1024 crg 768 thr 6144 ENOMEM > total_size 100663296K rsz 1024 crg 768 thr 12288 ENOMEM > ... > > Any suggestions are welcome. > > Thanks, > -Nathan > > > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss > _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
