On Thu, May 29, 2014 at 1:42 AM, Matthew W. Ross
<[email protected]> wrote:
>>   For example: If you've got six servers, each with a gigabit link,
>> plugged into your 2510, and then have a single gigabit link connecting
>> your 2510 to your 5308xl, the traffic from those servers could
>> overwhelm the uplink.  The buffer on the 2510 will fill and then start
>> dropping frames.
>
> Is there a term for this kind of buffer overflow on a switch? I want to know
> so I know what to look for if/when this problem comes up again.

  I'm not sure.  Generically, I would call it a "frame buffer
overflow", but that doesn't find nearly as many relevant Google hits
as I would except.

  (It's difficult to search on, because "overrun" may be used as a
synonym for "overflow", and "frame buffer" is also used in a
completely different way for video applications, and "buffer overflow"
is also used for bugs where something doesn't do bounds checking and
read/writes past the end of a designated memory region.  An overflow
would result in the switch discarding the frame, but so will checksum
failure and other problems.  Different implementation use "drops" to
mean different things.  Some stats include all causes, others some.)

  The overall effect of this kind of failure mode -- more traffic than
capacity -- is generally called "congestion", though.

>> Did you check port statistics on switches and servers?  Check the
>> logs on the switches?  Are all the fault finders enabled on the
>> switches?
>
> Logs were not helpful, and counters did not show a lot of errors. I was not
> looking at any "fault finders," but I will be now.

  "fault finder" is HP's name for things that monitor for trouble
conditions and alert you to them with log entries, and banners in the
web UI.  In the switch CLI, in config context, type "fault" and hit
[TAB], and you'll see your options.  I recommend having them all set
to high initially.  If any become annoying and are known not to be
trouble, you can turn them down individually.

> Definition in this case: VMware logs complaining of latency issues while
> communicating with the EqualLogic SAN. I don't remember the exact error, but
> it was up to about .5 seconds in latency. Thus, I figured there was a
> network latency problem.

  0.5 seconds is 500 ms, which is definitely very bad for two hops.
You should see <15 ms for a full size frame, typically.

  If that's actual *network* latency, something very odd is going on.

  I suspect, however, that that figure is after the results of
retransmission.  In other words, I think that message means it took
500 ms for the data to get there intact, and the reason it took so
long is there were a lot of dropped or corrupted frames.

> I expected the HPs to work fine as well. I do not have a valid explanation
> on why the did not suffice while the Dells did.

  Let me ask the obvious question first: Did you use the Dell switches
to replace the HP switches, or did you install them alongside?
Because if you're putting SAN traffic on dedicated switches where
before it was on shared switches, well then of course they'll work
better.  :)

  Procurve support should be very interested in helping you here.
"Why does a Dell 6224 work fine when a ProCurve 2910 fails miserably?"
is a great title for a trouble ticket.  ;-)

  It may be there is some other limiting factor in HP's products here,
in which case, I think it would benefit everyone to know what it is.

-- Ben


Reply via email to