On Thu, Mar 12, 2009 at 12:43 AM, Matthew Dillon <dil...@apollo.backplane.com> wrote: > > :Hi all, > : > :After making two NICs multiple receive queue working, I now propose to > :add Toeplitz hash function to map packet's CPU. It is mainly use to > :support "receive side scaling" > :(http://www.microsoft.com/whdc/device/network/ndis_rss.mspx) in > :.. > :forwarding path is even CPU localized). If the packet is non-fragment > :TCP, the hash is calculated based on laddr,faddr,lport,fport, else the > :hash is calculated using laddr,faddr. > : > :There are two things we need to overcome: > :1) The result of hash function is non-commutative in the M$ paper, > :i.e. faddr,laddr,fport,lport and laddr,faddr,lport,fport gives > :different result. > :Thanks to corecode's suggestion, as long as 0xabcd is replicated to > :form the key, the result of the hash function _is_ commutative. > : > :2) It is computational heavy > :Thanks to corecode again, we could cache a pre-calculated result > :table, so we actually only need to index a array and OR the results. > :.. > :A simple implementation is at: > :http://leaf.dragonflybsd.org/~sephe/toe.c > : > :I used it to verify that hardware gives the correct result :) > : > :The whole thing is not implemented yet, but if you don't think its bad > :idea, I will move on to implement it. Note, it is not intended to > :replace the current packet hash function. > : > :Best Regards, > :sephe > > Yes, I agree, we definitely need to make this an end-to-end capable > feature, similar to propogating hardware checksum state. My suggestion > is that we add appropriate flags and a hash field in the mbuf header.
Yeah, that's my plan too; we accidentally have 2 unused bytes at the end of m_pkthdr :) > > I don't understand the 0xabcd replication comment w/ regards to making The receive side scaling requires max 40bytes key. According to M$ paper the key content is completely random. Following is the key in M$'s RSS verification suite: 0x6d, 0x5a, 0x56, 0xda, 0x25, 0x5b, 0x0e, 0xc2, 0x41, 0x67, 0x25, 0x3d, 0x43, 0xa3, 0x8f, 0xb0, 0xd0, 0xca, 0x2b, 0xcb, 0xae, 0x7b, 0x30, 0xb4, 0x77, 0xcb, 0x2d, 0xa3, 0x80, 0x30, 0xf2, 0x0c, 0x6a, 0x42, 0xb7, 0x3b, 0xbe, 0xac, 0x01, 0xfa The hashing result is not commutative. However, if we pick two random bytes and form the key by replicating them, e.g. the two random byte is 0x5a6d (that's what I mean by 0xabcd :), then the key will be: 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a, 0x6d, 0x5a The hashing result is commutative. I need to note the reason that we only use _2_ random bytes to form the key is that the hash on TCP segment is calculated over a binary string formed by concatenating laddr,faddr,lport,fport, so we are actually limitted by the size of lport/fport which is 2 bytes. While this also reduces the size of the pre-calculated hash result table. > both directions of the connection produce the same hash value, but I > agree that would be desirable. Does it avoid having to recalculate > a hardware-supplied hash code? The hash is only calculated if userland application sends a packet; a simple optimization is to save the hash value in inpcb for TCP socket, so we could avoid calculating the hash for every segments (the hash results are actually same). On the receive side, the hash value is calculate by hardware, and the hardware will deliver it to us in RX desc, so we don't need to calculate it by ourselves. ============= When we dispatch a packet upon RX, there are two conditions that we take into consideration: 1) packet type 2) hash value of the packet As long as the hardware provides them, we even don't need to take a look at the data part of the mbuf (mbuf cluster is always used on RX path, so the data is always in different memory position) 1) shows the "receive side scaling" design flaw, M$ says nothing about it. However, hardware vendors seem to begin to realize it. jme (relative newer chips) provides packet type in RX desc. While old Intel hardware supported by em/emx does not provide that information without split header RX desc. However, it could be infered from the hash type and the RXCSUM results. Their newer hardware (82576) seems to provide the packet type in RX desc. Hmm, this does look like the trend :) Best Regards, sephe -- Live Free or Die