Re: Collection of strange lockups on 0.51
On Mon, Oct 1, 2012 at 8:42 PM, Tommi Virtanen t...@inktank.com wrote: On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote: Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems to appear more likely on 0.51 traffic patterns, which is very strange for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my case, exposed to extremely high load - database benchmark over 700+ rbd-backed VMs and cluster rebalance at same time. It explains post-reboot lockups in igb driver and all types of lockups above. I would very appreciate any suggestions of switch models which do not expose such behavior in simultaneous conditions both off-list and in this thread. I don't see how a switch dropping packets would give an ethernet card driver any excuse to crash, but I'm simultaneously happy to hear that it doesn't seem like Ceph is at fault, and sorry for your troubles. I don't have an up to date 1GbE card recommendation to share, but I would recommend making sure you're using a recent Linux kernel. I have incorrectly formulated a reason - of course drops can not cause a lockup by themselves, but switch may create somehow a long-lasting `corrupt` state on the trunk ports which leads to such lockups at the ethernet card. Of course I`ll play with the driver versions and card|port settings, thanks for suggestion :) I`m still investigating the issue since it is a quite hard to repeat in the right time and hope I`m able to capture this state using tcpdump-like, e.g. s/w methods - if card driver locks on something, it may prevent to process problematic byte sequence at packet sniffer level. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Collection of strange lockups on 0.51
On Sun, Sep 30, 2012 at 2:55 PM, Andrey Korolyov and...@xdel.ru wrote: Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems to appear more likely on 0.51 traffic patterns, which is very strange for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my case, exposed to extremely high load - database benchmark over 700+ rbd-backed VMs and cluster rebalance at same time. It explains post-reboot lockups in igb driver and all types of lockups above. I would very appreciate any suggestions of switch models which do not expose such behavior in simultaneous conditions both off-list and in this thread. I don't see how a switch dropping packets would give an ethernet card driver any excuse to crash, but I'm simultaneously happy to hear that it doesn't seem like Ceph is at fault, and sorry for your troubles. I don't have an up to date 1GbE card recommendation to share, but I would recommend making sure you're using a recent Linux kernel. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Collection of strange lockups on 0.51
On Thu, Sep 13, 2012 at 1:43 AM, Andrey Korolyov and...@xdel.ru wrote: On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote: On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [812b2975] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). NOPE. Xen was my choice for almost five years, but right now I am replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has same poor network performance as 3.x but can be really named stable. All those backtraces comes from bare hardware. At the end you can see nice backtrace which comes out soon after end of the boot sequence when I manually typed 'modprobe rbd', it may be any other command assuming from experience. As soon as I don`t know anything about long-lasting states in intel, especially of those which will survive ipmi reset button, I think that first-sight complain about igb may be not quite right. If there cards may save some of runtime states to EEPROM and pull them back then I`m wrong. Short post mortem - EX3200/12.1R2.9 may begin to drop packets (seems to appear more likely on 0.51 traffic patterns, which is very strange for L2 switching) when a bunch of the 802.3ad pairs, sixteen in my case, exposed to extremely high load - database benchmark over 700+ rbd-backed VMs and cluster rebalance at same time. It explains post-reboot lockups in igb driver and all types of lockups above. I would very appreciate any suggestions of switch models which do not expose such behavior in simultaneous conditions both off-list and in this thread. [172696.503900] [8100d025] ? paravirt_write_msr+0xb/0xe [172696.503942] [810325f3] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Collection of strange lockups on 0.51
Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. Before blaming system drivers and continuing to investigate a problem, may I ask if someone faced similar problem? I am using 802.ad on pair intel 350 for general connectivity. I have attached a bit of traces which was pushed to netconsole(in some cases, machine died hardly, e.g. not even sending a final bye over netconsole, so it is not complete). netcon.log.gz Description: GNU Zip compressed data
Re: Collection of strange lockups on 0.51
On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [812b2975] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). [172696.503900] [8100d025] ? paravirt_write_msr+0xb/0xe [172696.503942] [810325f3] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Collection of strange lockups on 0.51
On Thu, Sep 13, 2012 at 1:09 AM, Tommi Virtanen t...@inktank.com wrote: On Wed, Sep 12, 2012 at 10:33 AM, Andrey Korolyov and...@xdel.ru wrote: Hi, This is completely off-list, but I`m asking because only ceph trigger such a bug :) . With 0.51, following happens: if I kill an osd, one or more neighbor nodes may go to hanged state with cpu lockups, not related to temperature or overall interrupt count or la and it happens randomly over 16-node cluster. Almost sure that ceph triggerizing some hardware bug, but I don`t quite sure of which origin. Also after a short time after reset from such crash a new lockup may be created by any action. From the log, it looks like your ethernet driver is crapping out. [172517.057886] NETDEV WATCHDOG: eth0 (igb): transmit queue 7 timed out ... [172517.058622] [812b2975] ? netif_tx_lock+0x40/0x76 etc. The later oopses are talking about paravirt_write_msr etc, which makes me thing you're using Xen? You probably don't want to run Ceph servers inside virtualization (for production). NOPE. Xen was my choice for almost five years, but right now I am replaced it with kvm everywhere due to buggy 4.1 '-stable'. 4.0 has same poor network performance as 3.x but can be really named stable. All those backtraces comes from bare hardware. At the end you can see nice backtrace which comes out soon after end of the boot sequence when I manually typed 'modprobe rbd', it may be any other command assuming from experience. As soon as I don`t know anything about long-lasting states in intel, especially of those which will survive ipmi reset button, I think that first-sight complain about igb may be not quite right. If there cards may save some of runtime states to EEPROM and pull them back then I`m wrong. [172696.503900] [8100d025] ? paravirt_write_msr+0xb/0xe [172696.503942] [810325f3] ? leave_mm+0x3e/0x3e and *then* you get [172695.041709] sd 0:2:0:0: [sda] megasas: RESET cmd=2a retries=0 [172695.041745] megasas: [ 0]waiting for 35 commands to complete [172696.045602] megaraid_sas: no pending cmds after reset [172696.045644] megasas: reset successful which just adds more awesomeness to the soup -- though I do wonder if this could be caused by the soft hang from earlier. -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html