Bart Van Assche, on 03/01/2010 11:38 PM wrote:
On Mon, Mar 1, 2010 at 9:12 PM, Vladislav Bolkhovitin <[email protected] <mailto:[email protected]>> wrote:

    [ ... ]
    It's good if my impression was wrong. But you've got suspiciously
    low IOPS numbers. On your hardware you should have much more. Seems
    you experienced a bottleneck on the initiator somewhere above the
    drivers level (fio? sg engine? IRQs or context switches count?), so
    your results could be not really related to the topic. Oprofile and
    lockstat output can shed more light on this.


The number of IOPS I obtained is really high considering that I used the sg I/O engine. This means that no buffering has been used and none of the I/O requests were combined into larger requests. I chose the sg I/O engine on purpose in order to bypass the block layer. I was not interested in record IOPS numbers but in a test where most of the time is spent in the SRP / iSER initiator instead of the block layer.

116K IOPS'es isn't high, it's pretty low for QDR IB. Even 4Gbps FC can overperform it. Remember, Microsoft has managed to get 1 million IOPS'es from 10GbE, but your card should be much faster. This is why I have strong suspicious that the test is incorrect.

Let's estimate how much your IB card can achieve. It has 1us latency on 1 byte packets, so it can perform at least 1 millions op/sec. This is the upper bound estimation, because (1) if the card has multi-core setup, this number can be several times bigger, and (2) it includes data transfers. From other side, you can read data via your card on 2.9GB/s. If we consider that transferring a 512B packet has 100% overhead (this is upper bound estimation too, because I can't believe that such a low latency HPC interconnect has so huge data transfer overhead), this will give us that it can transfer 2.9 / (512 * 2) = 2.9 millions IOPS'es. So, your IB hardware should be capable to make at least 1 million I/O transfers per second, which is 10 times bigger than you have.

So, you definitely need to find out the bottleneck. I would start from checking:

1. fio implemented not too effectively. It can be checked using null ioengine.

2. You have only one outstanding command at time (queue depth 1). You can check it during the test either using iostat on the initiator, or (better) on the SCST target in /proc/scsi_tgt/sessions and /proc/scsi_tgt/sgv files.

3. sg engine used by fio in indirect mode, i.e. it transfers data between user and kernel spaces using data copy. Can be checked looking at the fio's sources or using oprofile.

Vlad
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to