Bart Van Assche, on 03/01/2010 11:38 PM wrote:
On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <[email protected]
<mailto:[email protected]>> wrote:
[ ... ]
It's good if my impression was wrong. But you've got suspiciously
low IOPS numbers. On your hardware you should have much more. Seems
you experienced a bottleneck on the initiator somewhere above the
drivers level (fio? sg engine? IRQs or context switches count?), so
your results could be not really related to the topic. Oprofile and
lockstat output can shed more light on this.
The number of IOPS I obtained is really high considering that I used the
sg I/O engine. This means that no buffering has been used and none of
the I/O requests were combined into larger requests. I chose the sg I/O
engine on purpose in order to bypass the block layer. I was not
interested in record IOPS numbers but in a test where most of the time
is spent in the SRP / iSER initiator instead of the block layer.
116K IOPS'es isn't high, it's pretty low for QDR IB. Even 4Gbps FC can
overperform it. Remember, Microsoft has managed to get 1 million IOPS'es
from 10GbE, but your card should be much faster. This is why I have
strong suspicious that the test is incorrect.
Let's estimate how much your IB card can achieve. It has 1us latency on
1 byte packets, so it can perform at least 1 millions op/sec. This is
the upper bound estimation, because (1) if the card has multi-core
setup, this number can be several times bigger, and (2) it includes data
transfers. From other side, you can read data via your card on 2.9GB/s.
If we consider that transferring a 512B packet has 100% overhead (this
is upper bound estimation too, because I can't believe that such a low
latency HPC interconnect has so huge data transfer overhead), this will
give us that it can transfer 2.9 / (512 * 2) = 2.9 millions IOPS'es. So,
your IB hardware should be capable to make at least 1 million I/O
transfers per second, which is 10 times bigger than you have.
So, you definitely need to find out the bottleneck. I would start from
checking:
1. fio implemented not too effectively. It can be checked using null
ioengine.
2. You have only one outstanding command at time (queue depth 1). You
can check it during the test either using iostat on the initiator, or
(better) on the SCST target in /proc/scsi_tgt/sessions and
/proc/scsi_tgt/sgv files.
3. sg engine used by fio in indirect mode, i.e. it transfers data
between user and kernel spaces using data copy. Can be checked looking
at the fio's sources or using oprofile.
Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html