On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore <garr...@nexenta.com> wrote: > On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote: >> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore <garr...@nexenta.com> wrote: >> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote: >> >> >> >> I think there may be very good reason to use iSCSI, if you're limited >> >> to gigabit but need to be able to handle higher throughput for a >> >> single client. I may be wrong, but I believe iSCSI to/from a single >> >> initiator can take advantage of multiple links in an active-active >> >> multipath scenario whereas NFS is only going to be able to take >> >> advantage of 1 link (at least until pNFS). >> > >> > There are other ways to get multiple paths. First off, there is IP >> > multipathing. which offers some of this at the IP layer. There is also >> > 802.3ad link aggregation (trunking). So you can still get high >> > performance beyond single link with NFS. (It works with iSCSI too, >> > btw.) >> >> With both IPMP and link aggregation, each TCP session will go over the >> same wire. There is no guarantee that load will be evenly balanced >> between links when there are multiple TCP sessions. As such, any >> scalability you get using these configurations will be dependent on >> having a complex enough workload, wise cconfiguration choices, and and >> a bit of luck. > > If you're really that concerned, you could use UDP instead of TCP. But > that may have other detrimental performance impacts, I'm not sure how > bad they would be in a data center with generally lossless ethernet > links.
Heh. My horror story with reassembly was actually with connectionless transports (LLT, then UDP). Oracle RAC's cache fusion sends 8 KB blocks via UDP by default, or LLT when used in the Veritas + Oracle RAC certified configuration from 5+ years ago. The use of Sun trunking with round robin hashing and the lack of use of jumbo packets made every cache fusion block turn into 6 LLT or UDP packets that had to be reassembled on the other end. This was on a 15K domain with the NICs spread across IO boards. I assume that interrupts for a NIC are handled by a CPU on the closest system board (Solaris 8, FWIW). If that assumption is true then there would also be a flurry of inter-system board chatter to put the block back together. In any case, performance was horrible until we got rid of round robin and enabled jumbo frames. > Btw, I am not certain that the multiple initiator support (mpxio) is > necessarily any better as far as guaranteed performance/balancing. (It > may be; I've not looked closely enough at it.) I haven't paid close attention to how mpxio works. The Veritas analog, vxdmp, does a very good job of balancing traffic down multiple paths, even when only a single LUN is accessed. The exact mode that dmp will use is dependent on the capabilities of the array it is talking to - many arrays work in an active/passive mode. As such, I would expect that with vxdmp or mpxio the balancing with iSCSI would be at least partially dependent on what the array said to do. > I should look more closely at NFS as well -- if multiple applications on > the same client are access the same filesystem, do they use a single > common TCP session, or can they each have separate instances open? > Again, I'm not sure. It's worse than that. A quick experiment with two different automounted home directories from the same NFS server suggests that both home directories share one TCP session to the NFS server. The latest version of Oracle's RDBMS supports a userland NFS client option. It would be very interesting to see if this does a separate session per data file, possibly allowing for better load spreading. >> Note that with Sun Trunking there was an option to load balance using >> a round robin hashing algorithm. When pushing high network loads this >> may cause performance problems with reassembly. > > Yes. Reassembly is Evil for TCP performance. > > Btw, the iSCSI balancing act that was described does seem a bit > contrived -- a single initiator and a COMSTAR server, both client *and > server* with multiple ethernet links instead of a single 10GbE link. > > I'm not saying it doesn't happen, but I think it happens infrequently > enough that its reasonable that this scenario wasn't one that popped > immediately into my head. :-) It depends on whether the people that control the network gear are the same ones that control servers. My experience suggests that if there is a disconnect, it seems rather likely that each group's standardization efforts, procurement cycles, and capacity plans will work against any attempt to have an optimal configuration. Also, it is rather common to have multiple 1 Gb links to servers going to disparate switches so as to provide resilience in the face of switch failures. This is not unlike (at a block diagram level) the architecture that you see in pretty much every SAN. In such a configuation, it is reasonable for people to expect that load balancing will occur. -- Mike Gerdts http://mgerdts.blogspot.com/ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss