Line 576-610: Your method register mapping is rather quaint. You shouldn't need to examine the "reg" property, but instead use ddi_regs_map_setup() and friends. (You can inquire about number of register sets and sizing information with ddi_dev_nregs(), ddi_dev_regsize(), etc.) I'd rather see you use the DDI convenience routines here. (Note that config space is set up with pci_config_setup().)
I agree in that is quaint... but... Memory space accesses don't work for the old VT86C100A cards. (They have memory space in their bar's but nothing happens if you write to them via memory space.).

But you should use a different a register number then... and you should then still be able to to use ddi_regs_map_setup(). Does this not work?

The issue is that the driver needs to know the access type (io space/mem space) for a set number. It queries the reg property to make that association. Then, ddi_regs_map_setup() is used with the 'right' register set number.

The webpage states:
As mentioned in "IEEE 1275 PCI Binding," IEEE 1275 PCI binding makes no specific connection between the entries in the "reg" property and the configuration registers in PCI configuration space. Driver developers should check the "reg" property to ensure that the correct rnumber is used in ddi_regs_map_setup(9F)."

So, I read this as "you can't assume that set number 1 always provides i/o space accesses".

Lines 717-760: I believe you shouldn't need to use driver.conf... Brussels is the "new" way forward, and I believe that a driver without legacy concerns shouldn't be using driver.conf anymore.

I added it for consistancy with the current state of solaris' drivers.
Just tell me again it needs to go and it's gone.

Nuke it. I'm working on a different ethernet driver for a popular commodity chip, and it will *not* have driver.conf settings. (Every friggin' driver always had its own unique tunables, so there was no consistency to follow. Brussels is the way forward.)

OK. I will remove it.

Line 1287: I see you're using desballoc. For a 100M driver, I think this adds complexity that is questionable.
For modern CPU's, the extra complexity of "zero copy" for a 100M device does not outweigh the advantage of an easy maintainable driver.
However, I specifically wrote this driver for use on my 600Mhz VIA C3,
(http://joostm.nl/solaris/onix/), and on this box, it matters. Every cycle not spend on copying data, is used for something useful.

But with normal sized ethernet packets, you may find that the cost of doing the DMA machinations may approach the cost of doing the copying. Especially when you add locking that you're currently missing.

Even a 600MHz Via C3 shouldn't matter much with 100M.

With smaller packets (e.g. typical for HTTP) the copy will always win. The tradeoff varies from one platform to another, but any "modern" (i.e. anything made this decade) system will probably do better copying at least up to 1K.



 Paritcularly, you need to
make sure that you don't have a race with your driver getting unloaded (e.g. via modunload()) while you still have mblks that are "loaned up". I believe your driver suffers from this race, which could ultimately lead to a panic in certain situations. "Fixing" this requires the driver to use a lock to guard modunload() from completing successfully if there are outstanding mblks via desballoc. (That also means you need a global reference counter.) Again, for a typical 100Mbit driver, the perf. gain is questionable anyway, so I'd not bother with this and just use bcopy. (Plus bcopy solves a lot of extra problems with resource management...)
This is in detach:
418 if (vrp->dmamem > 0) {
419  vr_log(vrp, CE_NOTE, "Can't detach, buffers in upper layer.");
420  return (DDI_FAILURE);
421 }

vrp->dmamem is the amount of memory allocated for DMA buffers for that instance. Is this sufficient?

No, you need to guard against race conditions.  That implies locking.

This area is really tricky to get right, and almost every driver that has ever tried to do this has gotten it WRONG.

The extra overhead this implies is painful. I'd *really* recommend you consider just using bcopy.

What driver does it right then? A right example would be useful.

This would solve the problem described at line 1377 as well, because you can then bcopy into a misaligned buffer, avoiding an extra data copy in the IP stack later. If you don't change this otherwise, you must add safeguards against the modunload race, and you *should* conditionalize it on the length of the packet. (Tiny packets will get dismal perf. on this path.)
The TX path is changed so that only packets of 128 bytes and above, are mapped.

I suspect that 128 needs to be more like 512 or even 1024.


Line 1786. Similar to the rx comments above, this approach, which is common but IMO ugly, is better eliminated and using a bcopy of the frame into prealloc'd frames. Especially, DMA setup and tear down is expensive and this actually *hurts* performance worse than just using bcopy on *many* platforms... especially if the frames are not full MTU frames. I think I wouldn't bother with this extra effort, but go with the simple bcopy approach. You've got to do that anyway if the packet doesn't fit your alignment or block count requirements, so its simpler to just skip all the tests and do the copy unconditionally, IMO. (For larger frames -- say more than 512 bits, the tradeoff may favor DMA setup and teardown, but its pretty close, I think. And, for 100Mbps with reasonably recent CPUs -- anything in the last decade or so -- you shouldn't need the special DMA tricks to get line-speed performance.) Note that unlike rx, you can't even get the benefit of reusing mappings... so the trade off here for bcopy vs. DMA much more strongly favors bcopy, I think.
Again, I agree. True for modern CPU's. On slower CPU's, things are different. The driver's objective is not to get wirespeed TCP throughput. It's more like, "get decent throughput in few as possible cycles". TX DMA mapping supports that objective. Every cycle not spend in the driver is used elsewhere.

See my earlier comments. If you want to do this, go ahead, but I think you're mistaken. You should do some performance analysis to see if you're really saving cpu cycles. I suspect you might be saving as much as you think. (It will vary by packet size, as well as specific CPU and MMU configuration, of course.)

OK. I will re-evaluate copy versus dma mapping on the VIA C3 box measured using NFS. Is your objection with dma mappings on the TX path, RX path or both?



You still need to handle the DMA binding properly, using locking. The locking will sap performance somewhat, which is one of the reasons that I so strongly advocate just copying.

In fact, I'm probably going to do this in the Gigabit driver I'm working on.

   -- Garrett

Thanks for caring!

Joost
--
Joost Mulders             +  email: [email protected]
Technical Consultant      +  phone: +31-33-45-15701
Client Solutions          +    fax: +31-33-45-15734
Sun Microsystems          + mobile: +31-6-5198-7268
-= Anything not done right, has to be done again =-
_______________________________________________
driver-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/driver-discuss

Reply via email to