Hi Min,
I had problems with build 130, and saw that the e1000g driver was
updated in 131. That's why we updated.
I haven't tested with previous builds and was not that extensive with my
tests in 130, so I cannot tell wether it's the same problem or not.
The kstat command returns
Reset Count 0
Reset Count 0
on both nodes.
A colleague of mine connected the two machines back to back (I'm at
home, sick, not at work).
I restarted the transfers that failed last night, I'll keep you posted
about it (already done 50GB).
I know this switch (HP 4208) doesn't support jumbo frames, so I haven't
enabled jumbo on the nodes.
If this tranfer works, could this mean our switch is kind of broken or
buggy ?
Regarding the replacement with other builds, I can do it on the
receiving end, but not that easily on the sending end (the one that
"looses" the connection).
Thanks,
Arnaud
Le 04/02/2010 07:07, Min Miles Xu a écrit :
Hi Arnaud,
On which build you started to notice the issue? Could you still ping
through when you noticed the error? What's the output of "kstat -m
e1000g |grep Reset"?
Recv_Length_Errors indicates the packets received are undersized(< 64
bytes) or oversized (> 1522 bytes when jumbo frames aren't enabled). I
expect to narrow down the issue by simplying the network
configurations You mentioned the two machines are connected via a
switch. Could you try to have the two machines connected back-to-back?
Further more, Could you try to replace one machine with other builds/OSs?
Thanks,
Miles Xu
Arnaud Brand wrote:
Hi folks,
My situation is the following : 2 computers (A and B) running
Opensolaris b131 having intel 82574L NICs, connected through an
HP4208 switch.
Both computers are on the same network.
I have transfers running from computer A to computer B, either
through ssh or netcat.
As long a computer B is not too busy, the transfer goes like a charm.
But when B's really busy (doing zfs recv from a local file in this
case) , the transfer fails is an odd way after some time (tests show
somewhere between 10 minutes and 13 hours).
What's odd is that A reports that he could not read from B and closes
the connection (no sign of it in netstat), but B still thinks the
connection is open.
Further, running "kstat -p | grep e1000g | grep -i err" on A show
all zeroes but for the following :
e1000g:1:statistics:Recv_Length_Errors 14
link:0:e1000g1:ierrors 14
e1000g:1:mac:ierrors 14
More details on the test cases is available there :
http://opensolaris.org/jive/thread.jspa?threadID=122977&tstart=0
You can see that Brent Jones mentionned the following CR but this is
marked as a dupplicate of something fixed in 131.
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6905510
I did not do any twiddling in e1000g.conf.
Both e1000g are grouped in a aggregation named trk0.
Per advice of Richard Elling, I disabled LACP and, just to be sure, I
unplugged one network cable on each machine.
If any of you has any clue or workaround to try, please share.
Thanks,
Arnaud
_______________________________________________
networking-discuss mailing list
[email protected]
_______________________________________________
networking-discuss mailing list
[email protected]