Vladislav Bolkhovitin wrote:
Cameron Harr wrote:
Cameron Harr wrote:
This is still too high. Considering that each CS is about 1
microsecond you can estimate how many IOPS's it costs you.
Dropping scst_threads down to 2, from 8, with 2 initiators, seems to
make a fairly significant difference, propelling me to a little over
100K IOPs and putting the CS rate around 2:1, sometimes lower. 2
threads gave the best performance compared to 1, 4 and 8.
Just as a status update, I've gotten my best performance with
scst_threads=3 on 2 initiators, and using a separate QP for each
drive an initiator is writing to. I'm getting pretty consistent
112-115K IOPs using two initiators, each writing with 2 processes to
the same 2 physical targets, using 512B blocks. Adding the second
initiator only bumps me up by about 20K IOPs, but as all the CPUs are
pegged around 99%, I'll take that as a bottleneck. Also, as a note
from Vlad's advice, the CS rate is now around 70K/s on 115K IOPs, so
it's not too bad. Interrupts (where this thread started), are around
200K/s - a lot higher than I thought they'd go, but I'm not
complaining. :)
Actually, what you did is tune your workload so it put nicely on all
the participating threads and CPU cores, so all the threads stay each
on its own CPU core and gracefully pass commands during processing to
each other being busy almost all the time. I.e. you put your system in
some kind of resonance. If you change your workload just a bit or
Linux scheduler changed in the next kernel version, your tuning would
be destroyed.
This "resonance" thought actually crossed my mind. I later went and ran
the test locally and found that I got better performance via SRP than I
did locally (good marketing for you :) ). The local run, using no
networking, gave me around 2 CS/IO. It appeared that when I added the
second initiator, the requests from the 2 initiators for a single target
would get coalesced, which would improve the performance.
So, I wouldn't overestimate your results. As I already wrote, the only
real fix is to remove all the unneeded context switches between
threads during commands processing. This fix would work not only on
carefully tuned artificial workloads, but on real life ones too. 5-10
threads participating in a single command processing reminds me the
famous set of histories about how many people of some kind is
necessary to change a burnt out lamp ;)
Nice analogy :). I wish I knew how to eradicate the extra context
switches. I'll try Bart's trick and see if I can get more info:
_______________________________________________
general mailing list
general@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general