Hello,

>>
>>...
>> If I get it right round about 6% (7.38% * 84.56%) of the time the
>>machine does a
>> memcpy inside __pskb_pull_tail. The comments on this function reads "...
>> it expands
>> header moving its tail forward and copying necessary data from
>>fragmented part. ...
>> It is pretty complicated. Luckily, it is called only in exceptional
>>cases ...".
>> That does not sound good at all. I repeated the test on a normal Intel
>>gigabit
>> network without jumbo frames and __pskb_pull_tail was not in the top
>>consumer list.
>
>> Does anyone have an idea if this is normal GRO behaviour for IPOIB. At
>>the moment
>> I have a full test environment and could implement and verify some
>>kernel
>> corrections if someone could give a helpful hint.
>
>As always, it would be good and helpful if you can re-run the test
>with the latest upstream kernel, e.g 3.9-rc, and anyway, I added Eric
>who might have some insight on the matter.
>
>Or.


going through hard lessons to understand the SKBs maybe I finally
found the reason for the unnecessary memcpy commands. Even with
newest 3.9-rc5 kernel the problem persists. IPoIB creates only
fragmented SKBs without any single bit in the normal data part. Some
debug messages during GRO handling showed

skb->len         = 1988 (total data)
skb->data_len    = 1988 (paged data)
skb_headlen(skb) = 0    (non paged data)

inet_gro_receive() requires the IP header inside the SKB. So it
pulls missing data from fragments. This process requires extra
memcpy operations. 

It all comes from ipoib_ud_need_sg() that determines if a
receive block will fit into a single page. Whenever this
function is called the one and only parameter is max_ib_mtu
of the device. In my case with a ConnectX card this defaults
to 4K no matter what MTU is really set. As a result IPoIB will
always create a separate SKB fragment for the incoming data.

My old but nicely working switch only allows a MTU of 2044 bytes.
So I assumed that I do not need to care about fragments and
modifed the priv->max_ib_mtu hardcoded to 3072. Pages are sufficient
large for this MTU. A quick test afterwards without claim of
perfectionism showed the expected effects.

1) no more additional memcpy operations
2) netperf throughput raised from ~ 5.3GBit to ~ 5.8GBit

I hope that I'm not totally wrong with this finding and my
simple explanation is conclusive. Maybe someone with more
knowledge about this all can assist me to get an offical
patch into the RDMA development tree?

Thanks in advance.

Markus


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to