Vladislav Bolkhovitin wrote:
Cameron Harr wrote:
Cameron Harr wrote:
This is still too high. Considering that each CS is about 1 microsecond you can estimate how many IOPS's it costs you.
Dropping scst_threads down to 2, from 8, with 2 initiators, seems to make a fairly significant difference, propelling me to a little over 100K IOPs and putting the CS rate around 2:1, sometimes lower. 2 threads gave the best performance compared to 1, 4 and 8.

Just as a status update, I've gotten my best performance with scst_threads=3 on 2 initiators, and using a separate QP for each drive an initiator is writing to. I'm getting pretty consistent 112-115K IOPs using two initiators, each writing with 2 processes to the same 2 physical targets, using 512B blocks. Adding the second initiator only bumps me up by about 20K IOPs, but as all the CPUs are pegged around 99%, I'll take that as a bottleneck. Also, as a note from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so it's not too bad. Interrupts (where this thread started), are around 200K/s - a lot higher than I thought they'd go, but I'm not complaining. :)

Actually, what you did is tune your workload so it put nicely on all the participating threads and CPU cores, so all the threads stay each on its own CPU core and gracefully pass commands during processing to each other being busy almost all the time. I.e. you put your system in some kind of resonance. If you change your workload just a bit or Linux scheduler changed in the next kernel version, your tuning would be destroyed.

This "resonance" thought actually crossed my mind. I later went and ran the test locally and found that I got better performance via SRP than I did locally (good marketing for you :) ). The local run, using no networking, gave me around 2 CS/IO. It appeared that when I added the second initiator, the requests from the 2 initiators for a single target would get coalesced, which would improve the performance.
So, I wouldn't overestimate your results. As I already wrote, the only real fix is to remove all the unneeded context switches between threads during commands processing. This fix would work not only on carefully tuned artificial workloads, but on real life ones too. 5-10 threads participating in a single command processing reminds me the famous set of histories about how many people of some kind is necessary to change a burnt out lamp ;)
Nice analogy :). I wish I knew how to eradicate the extra context switches. I'll try Bart's trick and see if I can get more info:
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to