Heald, Nathan T. wrote: > Hi everyone, > I have been running sgpdd-survey on some DDN 9550's and am getting some > errors. I'm using what I believe to be the latest version of the I/O Kit > (lustre-iokit-1.2-200709210921). I've got 4 OSSes attached and run > sgpdd-survey against all the disk from each host one at a time. Each host is > getting these errors, but not identically. I've found several threads on the > mailing list with people reporting this same error but there are no > resolutions posted. One post suggested a modification to the flags for > "sg_readcap" in the script could resolve these errors, but making the > changes did not seem to fix the issue. It looks like sgp_dd is having > intermittent problems: > > 16384+0 records out > sg starting in command at "sgp_dd.c":827: Cannot allocate memory
[snip] > > Output from sgpdd-survey: > > Wed Dec 1 10:55:55 EST 2010 sgpdd-survey on /dev/sdp /dev/sdo /dev/sdn > /dev/sdw /dev/sdv /dev/sdu /dev/sdt /dev/sds /dev/sdy /dev/sdr /dev/sdx > /dev/sdq from oss1 > ... > total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384 > x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read > 385.72 MB/s 384 x 1.01 = 388.18 MB/s > total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121 > failed > total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM You just don't have enough RAM to do these particular runs. If you look at the line ending in ENOMEM above: sgpdd-survey is proposing to launch 384 separate sgp_dd processes for each of 12 different devices, with each process launching 16 threads (6144 / 384), and each thread allocating at least 1 1 MiB write buffer. That adds up to 72 GiB of RAM for write buffers. The ENOMEM line means that the sgpdd-survey script looked at the amount of physical RAM you have, and estimated it wasn't enough to do this run. You could try running sgpdd-survey against each block device one at a time, which will reduce the needed RAM by a factor of 12 (in your case), but of course isn't quite equivalent. sg_readcap is used to determine the physical sector size and capacity (sector count) of each block device. I wouldn't think changing the flags on it would help anything. Jim Shankland Whamcloud, Inc. _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
