Daniel Ouellet wrote:
I am not saying it's the same problem here, but it sure behave the
exact same way. See if you have timeout in the logs or not from that
hme driver. But without you doing more tests on your box, it will not
be looked at before it's done for sure.
I really hope it help you never the less and give you some ideas to
try. The best way to get help if to help yourself first and really try
many things and then you have more valuable data to use and report with.
Thanks for the suggestions. I have tried quite a few things so far, and
will continue to test. Since this is a production environment, I can
only work on it off-hours. The problem is intermittent and not always
easy to reproduce, so it will take some time. Last night I got it to
hang several times, but later the same load tests running for hours were
not triggering it. The host and switch are forced to 100Mbps/Full and
I've installed new cables. I tried several other switches and ports,
and no matter what I do, even if it doesn't hang I always get interface
errors. With gem it was ierrs, with hme it was oerrs, and I even tried
an fxp card from a Compaq server and with that I get both ierrs and
oerrs, but only a fraction of a percent of total packets. Today so far
I have 160 errors out of 30M packets along with some input and CRC
errors on the Cisco switch port. I would think these counters should be
0 unless something is really wrong. But throughput seems fine.
So right now I'm focusing more on trying to eliminate the error counts
while giving the system "time" to get to the point where it may hang.
No idea if the two issues are related. I'm definitely not giving up or
expecting someone else to do all the troubleshooting...I was just hoping
that maybe someone either knew about a fix or could give me some ideas
where to look. Now that I have some more ideas, I will continue trying
new things and look for an answer.
A question about logging--why do I not get any log entries or console
messages about these failures? Almost everyone else that had these
kinds of problems had log messages. Is there a way to enable verbose
logging in the network drivers? Or is there something I can do or
capture when the failure is occurring that would help me see what is
going on? Without that, I feel like I'm just guessing.
Bryan