Has anyone run obdfilter-survey with large object/thread counts and can share their experiences? I having some odd issues and am trying to determine if it is a setup issue on my side, or something I should file a bug on with Sun.
For this test, I've got a Dell 1950 dual socket, quad core 2.3 GHz Xeon with 6 MB of cache, 16 GB of memory, and an DDR IB connection to my storage. There are 7 LUNs on the storage to be driven by this OSS, though the problem shows up with fewer OSTs under test. Lustre version is 1.6.5 + patches (echo_client fix, sd_iostats fix) So, I'm trying to write out 8 GB per OST to amortize start up costs and minimize cache effects. Things are fine until I hit high thread and object counts: ost 3 sz 50331648K rsz 1024 obj 768 thr 1536 write 1792.53 [ XXX, XXX] read 1788.48 [ XXX, XXX] ost 3 sz 50331648K rsz 1024 obj 1536 thr 1536 write 2972.45 SHORT read 5376.91 SHORT As much as I love those numbers, I am quite certain that I'm not pushing ~1800 MB/s through a single DDR Infiniband port, let alone ~3000-5300 MB/s. The details file for the 3 OST, 512 object/OST, 1 thread/obj run shows no status reports from the run. The 256 object/OST, 2 thread/obj run did have some status lines. =============> write widow-oss1b1:lusintjo-OST0008_ecc Print status every 1 seconds --threads: starting 512 threads on device 9 running test_brw 32 wx q 256 1t5598 =============> write widow-oss1b1:lusintjo-OST0008_ecc Print status every 1 seconds --threads: starting 512 threads on device 10 running test_brw 32 wx q 256 1t3561 =============> write widow-oss1b1:lusintjo-OST0008_ecc Print status every 1 seconds --threads: starting 512 threads on device 11 running test_brw 32 wx q 256 1t1525 =============> write global Nothing showed up in dmesg during those runs. After a failed run with 7 OSTs on the OSS, at 512 objs/512 threads: ost 7 sz 117440512K rsz 1024 obj 3584 thr 3584 write 3606.52 SHORT read 12968.83 SHORT I found the following in dmesg: Lustre: 24000:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) lusintjo-OST0010: slow journal start 30s Lustre: 24000:0:(filter_io_26.c:717:filter_commitrw_write()) lusintjo-OST0010: slow brw_start 30s Lustre: 24178:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) lusintjo-OST0000: slow journal start 30s Lustre: 24178:0:(filter_io_26.c:717:filter_commitrw_write()) lusintjo-OST0000: slow brw_start 30s Lustre: 22247:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) lusintjo-OST0004: slow journal start 30s Lustre: 22247:0:(lustre_fsfilt.h:246:fsfilt_brw_start_log()) Skipped 3 previous similar messages Lustre: 22247:0:(filter_io_26.c:717:filter_commitrw_write()) lusintjo-OST0004: slow brw_start 30s Lustre: 22247:0:(filter_io_26.c:717:filter_commitrw_write()) Skipped 3 previous similar messages The Lustre tunables have not been changed from their defaults, so that could be contributing to this. Does anyone have any experiences that could shed more light on this? What other information can I provide that would be helpful? -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
