Jim Shankland wrote: >> ... >> total_size 100663296K rsz 1024 crg 384 thr 768 write 388.20 MB/s 384 >> x 1.01 = 388.18 MB/s read 387.16 MB/s 384 x 1.01 = 388.18 MB/s >> total_size 100663296K rsz 1024 crg 384 thr 1536 write 1 failed read >> 385.72 MB/s 384 x 1.01 = 388.18 MB/s >> total_size 100663296K rsz 1024 crg 384 thr 3072 write 140 failed read 121 >> failed >> total_size 100663296K rsz 1024 crg 384 thr 6144 ENOMEM >> > > You just don't have enough RAM to do these particular runs. > If you look at the line ending in ENOMEM above: sgpdd-survey > is proposing to launch 384 separate sgp_dd processes for each > of 12 different devices, with each process launching 16 > threads (6144 / 384), and each thread allocating at least 1 1 > MiB write buffer. That adds up to 72 GiB of RAM for write > buffers. The ENOMEM line means that the sgpdd-survey script > looked at the amount of physical RAM you have, and estimated it > wasn't enough to do this run. >
It's not just the ENOMEM at 6144 total threads that is the problem, it is the "write X failed", etc, at the _lower_ thread counts. From memory, the "crg" and "thr" numbers are already multiplied by 12 (the number of devices being tested), so "thr" should reflect the total number of buffers required. For this test, it looks like crg=32 and SG_MAX_QUEUE is the default 16. So the memory consumption _should not_ be an issue, but sgp_dd is still having problems allocating buffers. Again, I've seen this even when I clearly had free memory on the node, so I think there is something else at work here. Kevin _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
