I suspect that the difference in time to access unaligned packed data
vs. aligned data is irrelevant when compared with the general
complexity of running a modern host adaptor.
A more rational layout would be
typedef struct {
uchar l[4];
uchar m[4];
char f;
char g;
} Dac960fu;
which takes 12 bytes (including padding at the end) vs. 10 for the
unaligned packed version. Since PCI buses transfer whole longs, it
would take 3 transfers in either case. If you've got 46 integers of
variable size, it makes sense to sort them by size, at least as a
first cut, to minimise space wasted by padding. You can also
sometimes pack more tightly than a simple sort would suggest (e.g., if
you have shorts and chars).
It sounds like the folks who designed the dac960 either didn't think
much about how drivers would access it, or they were hog wild over
gcc's packed data attribute (does microsoft's compiler have something
similar?).