Nicolas Droux wrote: > > On May 14, 2009, at 10:51 AM, Andrew Gallatin wrote: > >> Nitin Hande wrote: >>> Andrew Gallatin wrote: >> >>>> When looking at this, I noticed mac_tx_serializer_mode(). Am I reading >>>> this right, in that is serializes a single queue? That seems lacking, >>>> compared to the nxge_serialize stuff it replaces. >>> Yes. This part was done for nxge and as far as I remember recent >>> performance of this scheme was very close to that of the previous >>> scheme. I think Gopi can comment more on this. What part do you >>> think is missing here ? >> >> Perhaps I'm missing something.. Doesn't nxge support multiple TX rings? >> If so, does the existing serialization serialize all traffic to a >> single ring, or is mac_tx_serializer_mode() applied after >> mac_tx_fanout_mode()? >> >> I had thought the original nxge serializer serialized each TX ring >> separately in nxge. The fork I made of it for myri10ge certainly >> works that way. > > The serializer is only for use by the nxge driver which has an > inefficient TX path locking implementation. We didn't have the resources > to completely rewrite the nxge transmit path as part of the Crossbow > project so we moved the serialization implementation in MAC for that > driver. The serializer in MAC does serialization on a per-ring basis. > The serializer should not be used by any other driver.
krgopi said, in an earlier reply "mac_tx_serializer_mode() is used when you have a single Tx ring. nxge would not use that mode". So I'm confused. From the source, it looks like nxge is using that mode (MAC_VIRT_SERIALIZE |'ed into mi_v12n_level). So I guess it is restricted to using only one of its hw tx rings, then? > You don't have to use the serializer to support multiple TX rings. Keep > your TX path lean and mean, apply good design principles, e.g. avoid > holding locks for too long on your data-path, and you should be fine. FWIW. my tx path is very "lean and mean". The only time locks are held are when writing the tx descriptors to the NIC, and when allocating a dma handle from a pre-allocated per-ring pool. I thought the serializer was silly too, but PAE claimed a speedup from it. I think that PAE claimed the speedup came from never back-pressuring the stack when the host overran the NIC. One of the "features" of the serializer was to always block the calling thread if the tx queue was exhausted. Have you done any packets-per-second benchmarks with your fanout code? I'm concerned that its very cache unfriendly if you have a long run of packets all going to the same destination. This is because you walk the mblk chain, reading the packet headers and queue up a big chain. If the chain gets too long, the mblk and/or the packet headers will be pushed out of cache by the time they make it to the driver's xmit routine. So in this case you could have twice as many cache misses as normal when things get really backed up. Last, you (or somebody) mentioned there was interest in adding a hook for a driver to do fanout. Is there a bugid or something for this? Drew
