On 14/12/10 20:27, Kevin Van Maren wrote: > Jim Shankland wrote: >>> ... >>> total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384 >>> x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s >>> total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read >>> 385.72 MB/s 384 x 1.01 = 388.18 MB/s >>> total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121 >>> failed >>> total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM >>> >> >> You just don't have enough RAM to do these particular runs. >> If you look at the line ending in ENOMEM above: sgpdd-survey >> is proposing to launch 384 separate sgp_dd processes for each >> of 12 different devices, with each process launching 16 >> threads (6144 / 384), and each thread allocating at least 1 1 >> MiB write buffer. That adds up to 72 GiB of RAM for write >> buffers. The ENOMEM line means that the sgpdd-survey script >> looked at the amount of physical RAM you have, and estimated it >> wasn't enough to do this run. >> > > It's not just the ENOMEM at 6144 total threads that is the problem, it > is the "write X failed", etc, at the _lower_ thread counts. > > From memory, the "crg" and "thr" numbers are already multiplied by 12 > (the number of devices being tested), so "thr" should reflect the total > number of buffers required. For this test, it looks like crg=32 and > SG_MAX_QUEUE is the default 16. So the memory consumption _should not_ > be an issue, but sgp_dd is still having problems allocating buffers. > > Again, I've seen this even when I clearly had free memory on the node, > so I think there is something else at work here. >
I've run into this problem (on a scientific linux 5.5 machine). If I use /dev/sg1, I get the following: [root@sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sg1 seek=1024 thr=1 count=1677721 bs=512 bpt=2048 time=1 sg starting out command at "sgp_dd.c":872: Cannot allocate memory whereas if I use /dev/sdb, I get: [root@sn86 lustre]# sgp_dd if=/dev/zero of=/dev/sdb seek=1024 thr=1 count=1677721 bs=512 bpt=2048 time=1 time to transfer data was 0.485030 secs, 1771.01 MB/sec They correspond to the same disk: [root@sn86 lustre]# sg_map | grep sdb /dev/sg1 /dev/sdb Have I just defeated the point of using sgp_dd? Is the fact that this really a sata disk (behind a Dell H700 controller) the problem? Chris _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
