Re: Collection of strange lockups on 0.51

2012-10-03 Thread Andrey Korolyov
On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen t...@inktank.com wrote:
 On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
 to appear more likely on 0.51 traffic patterns, which is very strange
 for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
 case, exposed to extremely high load - database benchmark over 700+
 rbd-backed VMs and cluster rebalance at same time. It explains
 post-reboot lockups in igb driver and all types of lockups above. I
 would very appreciate any suggestions of switch models which do not
 expose such behavior in simultaneous conditions both off-list and in
 this thread.

 I don't see how a switch dropping packets would give an ethernet card
 driver any excuse to crash, but I'm simultaneously happy to hear that
 it doesn't seem like Ceph is at fault, and sorry for your troubles.

 I don't have an up to date 1GbE card recommendation to share, but I
 would recommend making sure you're using a recent Linux kernel.

I have incorrectly formulated a reason - of course drops can not cause
a lockup by themselves, but switch may create somehow a long-lasting
`corrupt` state on the trunk ports which leads to such lockups at the
ethernet card. Of course I`ll play with the driver versions and
card|port settings, thanks for suggestion :)

I`m still investigating the issue since it is a quite hard to repeat
in the right time and hope I`m able to capture this state using
tcpdump-like, e.g. s/w methods - if card driver locks on something, it
may prevent to process problematic byte sequence at packet sniffer level.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Collection of strange lockups on 0.51

2012-10-01 Thread Tommi Virtanen
On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote:
 Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
 to appear more likely on 0.51 traffic patterns, which is very strange
 for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
 case, exposed to extremely high load - database benchmark over 700+
 rbd-backed VMs and cluster rebalance at same time. It explains
 post-reboot lockups in igb driver and all types of lockups above. I
 would very appreciate any suggestions of switch models which do not
 expose such behavior in simultaneous conditions both off-list and in
 this thread.

I don't see how a switch dropping packets would give an ethernet card
driver any excuse to crash, but I'm simultaneously happy to hear that
it doesn't seem like Ceph is at fault, and sorry for your troubles.

I don't have an up to date 1GbE card recommendation to share, but I
would recommend making sure you're using a recent Linux kernel.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Collection of strange lockups on 0.51

2012-09-30 Thread Andrey Korolyov
On Thu, Sep 13, 2012 at 1:43 AM, Andrey Korolyov and...@xdel.ru wrote:
 On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote:
 On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 This is completely off-list, but I`m asking because only ceph trigger
 such a bug :) .

 With 0.51, following happens: if I kill an osd, one or more neighbor
 nodes may go to hanged state with cpu lockups, not related to
 temperature or overall interrupt count or la and it happens randomly
 over 16-node cluster. Almost sure that ceph triggerizing some hardware
 bug, but I don`t quite sure of which origin. Also after a short time
 after reset from such crash a new lockup may be created by any action.

 From the log, it looks like your ethernet driver is crapping out.

 [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
 ...
 [172517.058622]  [812b2975] ? netif_tx_lock+0x40/0x76

 etc.

 The later oopses are talking about paravirt_write_msr etc, which makes
 me thing you're using Xen? You probably don't want to run Ceph servers
 inside virtualization (for production).

 NOPE. Xen was my choice for almost five years, but right now I am
 replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
 same poor network performance as 3.x but can be really named stable.
 All those backtraces comes from bare hardware.

 At the end you can see nice backtrace which comes out soon after end
 of the boot sequence when I manually typed 'modprobe rbd', it may be
 any other command assuming from experience. As soon as I don`t know
 anything about long-lasting states in intel, especially of those which
 will survive ipmi reset button, I think that first-sight complain
 about igb may be not quite right. If there cards may save some of
 runtime states to EEPROM and pull them back then I`m wrong.

Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems
to appear more likely on 0.51 traffic patterns, which is very strange
for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my
case, exposed to extremely high load - database benchmark over 700+
rbd-backed VMs and cluster rebalance at same time. It explains
post-reboot lockups in igb driver and all types of lockups above. I
would very appreciate any suggestions of switch models which do not
expose such behavior in simultaneous conditions both off-list and in
this thread.



 [172696.503900]  [8100d025] ? paravirt_write_msr+0xb/0xe
 [172696.503942]  [810325f3] ? leave_mm+0x3e/0x3e

 and *then* you get

 [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
 [172695.041745] megasas: [ 0]waiting for 35 commands to complete
 [172696.045602] megaraid_sas: no pending cmds after reset
 [172696.045644] megasas: reset successful

 which just adds more awesomeness to the soup -- though I do wonder if
 this could be caused by the soft hang from earlier.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Collection of strange lockups on 0.51

2012-09-12 Thread Andrey Korolyov
Hi,

This is completely off-list, but I`m asking because only ceph trigger
such a bug :) .

With 0.51, following happens: if I kill an osd, one or more neighbor
nodes may go to hanged state with cpu lockups, not related to
temperature or overall interrupt count or la and it happens randomly
over 16-node cluster. Almost sure that ceph triggerizing some hardware
bug, but I don`t quite sure of which origin. Also after a short time
after reset from such crash a new lockup may be created by any action.

Before blaming system drivers and continuing to investigate a problem,
may I ask if someone faced similar problem? I am using 802.ad on pair
intel 350 for general connectivity. I have attached a bit of traces
which was pushed to netconsole(in some cases, machine died hardly,
e.g. not even sending a final bye over netconsole, so it is not
complete).


netcon.log.gz
Description: GNU Zip compressed data


Re: Collection of strange lockups on 0.51

2012-09-12 Thread Tommi Virtanen
On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 This is completely off-list, but I`m asking because only ceph trigger
 such a bug :) .

 With 0.51, following happens: if I kill an osd, one or more neighbor
 nodes may go to hanged state with cpu lockups, not related to
 temperature or overall interrupt count or la and it happens randomly
 over 16-node cluster. Almost sure that ceph triggerizing some hardware
 bug, but I don`t quite sure of which origin. Also after a short time
 after reset from such crash a new lockup may be created by any action.

From the log, it looks like your ethernet driver is crapping out.

[172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
...
[172517.058622]  [812b2975] ? netif_tx_lock+0x40/0x76

etc.

The later oopses are talking about paravirt_write_msr etc, which makes
me thing you're using Xen? You probably don't want to run Ceph servers
inside virtualization (for production).

[172696.503900]  [8100d025] ? paravirt_write_msr+0xb/0xe
[172696.503942]  [810325f3] ? leave_mm+0x3e/0x3e

and *then* you get

[172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
[172695.041745] megasas: [ 0]waiting for 35 commands to complete
[172696.045602] megaraid_sas: no pending cmds after reset
[172696.045644] megasas: reset successful

which just adds more awesomeness to the soup -- though I do wonder if
this could be caused by the soft hang from earlier.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Collection of strange lockups on 0.51

2012-09-12 Thread Andrey Korolyov
On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote:
 On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote:
 Hi,
 This is completely off-list, but I`m asking because only ceph trigger
 such a bug :) .

 With 0.51, following happens: if I kill an osd, one or more neighbor
 nodes may go to hanged state with cpu lockups, not related to
 temperature or overall interrupt count or la and it happens randomly
 over 16-node cluster. Almost sure that ceph triggerizing some hardware
 bug, but I don`t quite sure of which origin. Also after a short time
 after reset from such crash a new lockup may be created by any action.

 From the log, it looks like your ethernet driver is crapping out.

 [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out
 ...
 [172517.058622]  [812b2975] ? netif_tx_lock+0x40/0x76

 etc.

 The later oopses are talking about paravirt_write_msr etc, which makes
 me thing you're using Xen? You probably don't want to run Ceph servers
 inside virtualization (for production).

NOPE. Xen was my choice for almost five years, but right now I am
replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has
same poor network performance as 3.x but can be really named stable.
All those backtraces comes from bare hardware.

At the end you can see nice backtrace which comes out soon after end
of the boot sequence when I manually typed 'modprobe rbd', it may be
any other command assuming from experience. As soon as I don`t know
anything about long-lasting states in intel, especially of those which
will survive ipmi reset button, I think that first-sight complain
about igb may be not quite right. If there cards may save some of
runtime states to EEPROM and pull them back then I`m wrong.


 [172696.503900]  [8100d025] ? paravirt_write_msr+0xb/0xe
 [172696.503942]  [810325f3] ? leave_mm+0x3e/0x3e

 and *then* you get

 [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0
 [172695.041745] megasas: [ 0]waiting for 35 commands to complete
 [172696.045602] megaraid_sas: no pending cmds after reset
 [172696.045644] megasas: reset successful

 which just adds more awesomeness to the soup -- though I do wonder if
 this could be caused by the soft hang from earlier.
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html