On Mon, Jul 26, 2010 at 1:27 AM, Garrett D'Amore <garr...@nexenta.com> wrote:
> On Sun, 2010-07-25 at 21:39 -0500, Mike Gerdts wrote:
>> On Sun, Jul 25, 2010 at 8:50 PM, Garrett D'Amore <garr...@nexenta.com> wrote:
>> > On Sun, 2010-07-25 at 17:53 -0400, Saxon, Will wrote:
>> >>
>> >> I think there may be very good reason to use iSCSI, if you're limited
>> >> to gigabit but need to be able to handle higher throughput for a
>> >> single client. I may be wrong, but I believe iSCSI to/from a single
>> >> initiator can take advantage of multiple links in an active-active
>> >> multipath scenario whereas NFS is only going to be able to take
>> >> advantage of 1 link (at least until pNFS).
>> >
>> > There are other ways to get multiple paths.  First off, there is IP
>> > multipathing. which offers some of this at the IP layer.  There is also
>> > 802.3ad link aggregation (trunking).  So you can still get high
>> > performance beyond  single link with NFS.  (It works with iSCSI too,
>> > btw.)
>>
>> With both IPMP and link aggregation, each TCP session will go over the
>> same wire.  There is no guarantee that load will be evenly balanced
>> between links when there are multiple TCP sessions.  As such, any
>> scalability you get using these configurations will be dependent on
>> having a complex enough workload, wise cconfiguration choices, and and
>> a bit of luck.
>
> If you're really that concerned, you could use UDP instead of TCP.  But
> that may have other detrimental performance impacts, I'm not sure how
> bad they would be in a data center with generally lossless ethernet
> links.

Heh.  My horror story with reassembly was actually with connectionless
transports (LLT, then UDP).  Oracle RAC's cache fusion sends 8 KB
blocks via UDP by default, or LLT when used in the Veritas + Oracle
RAC certified configuration from 5+ years ago.  The use of Sun
trunking with round robin hashing and the lack of use of jumbo packets
made every cache fusion block turn into 6 LLT or UDP packets that had
to be reassembled on the other end.  This was on a 15K domain with the
NICs spread across IO boards.  I assume that interrupts for a NIC are
handled by a CPU on the closest system board (Solaris 8, FWIW).  If
that assumption is true then there would also be a flurry of
inter-system board chatter to put the block back together.  In any
case, performance was horrible until we got rid of round robin and
enabled jumbo frames.

> Btw, I am not certain that the multiple initiator support (mpxio) is
> necessarily any better as far as guaranteed performance/balancing.  (It
> may be; I've not looked closely enough at it.)

I haven't paid close attention to how mpxio works.  The Veritas
analog, vxdmp, does a very good job of balancing traffic down multiple
paths, even when only a single LUN is accessed.  The exact mode that
dmp will use is dependent on the capabilities of the array it is
talking to - many arrays work in an active/passive mode.  As such, I
would expect that with vxdmp or mpxio the balancing with iSCSI would
be at least partially dependent on what the array said to do.

> I should look more closely at NFS as well -- if multiple applications on
> the same client are access the same filesystem, do they use a single
> common TCP session, or can they each have separate instances open?
> Again, I'm not sure.

It's worse than that.  A quick experiment with two different
automounted home directories from the same NFS server suggests that
both home directories share one TCP session to the NFS server.

The latest version of Oracle's RDBMS supports a userland NFS client
option.  It would be very interesting to see if this does a separate
session per data file, possibly allowing for better load spreading.

>> Note that with Sun Trunking there was an option to load balance using
>> a round robin hashing algorithm.  When pushing high network loads this
>> may cause performance problems with reassembly.
>
> Yes.  Reassembly is Evil for TCP performance.
>
> Btw, the iSCSI balancing act that was described does seem a bit
> contrived -- a single initiator and a COMSTAR server, both client *and
> server* with multiple ethernet links instead of a single 10GbE link.
>
> I'm not saying it doesn't happen, but I think it happens infrequently
> enough that its reasonable that this scenario wasn't one that popped
> immediately into my head. :-)

It depends on whether the people that control the network gear are the
same ones that control servers.  My experience suggests that if there
is a disconnect, it seems rather likely that each group's
standardization efforts, procurement cycles, and capacity plans will
work against any attempt to have an optimal configuration.

Also, it is rather common to have multiple 1 Gb links to servers going
to disparate switches so as to provide resilience in the face of
switch failures.  This is not unlike (at a block diagram level) the
architecture that you see in pretty much every SAN.  In such a
configuation, it is reasonable for people to expect that load
balancing will occur.

-- 
Mike Gerdts
http://mgerdts.blogspot.com/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to