Hello everyone! Looks like I've got a problem. Suddenly, srp connection failed. Here is dmesg from my node: === [Tue Mar 25 11:36:43 2014] INFO: task blkback.1.xvda:20168 blocked for more than 120 seconds. [Tue Mar 25 11:36:43 2014] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Tue Mar 25 11:36:43 2014] blkback.1.xvda D ffff8801b9a53f40 0 20168 2 0x00000000 [Tue Mar 25 11:36:43 2014] ffff88012dd84040 0000000000000246 00000000001c001c ffff880176244040 [Tue Mar 25 11:36:43 2014] 0000000000013f40 ffff88016aae7fd8 ffff88016aae7fd8 ffff88012dd84040 [Tue Mar 25 11:36:43 2014] ffffffff813907d4 ffff88015c2bb000 ffff88016aae79b0 ffff88015c2bb290 [Tue Mar 25 11:36:43 2014] Call Trace: [Tue Mar 25 11:36:43 2014] [<ffffffff813907d4>] ? _raw_spin_unlock_irqrestore+0xc/0xd [Tue Mar 25 11:36:43 2014] [<ffffffffa0297ef5>] ? md_write_start+0x131/0x147 [md_mod] [Tue Mar 25 11:36:43 2014] [<ffffffff811bd8f7>] ? arch_local_irq_restore+0x7/0x8 [Tue Mar 25 11:36:43 2014] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 [Tue Mar 25 11:36:43 2014] [<ffffffffa047171d>] ? make_request+0x37/0xa63 [raid1] [Tue Mar 25 11:36:43 2014] [<ffffffff811ad689>] ? blk_peek_request+0x1c2/0x209 [Tue Mar 25 11:36:43 2014] [<ffffffff8107f67f>] ? arch_local_irq_disable+0x7/0x8 [Tue Mar 25 11:36:43 2014] [<ffffffff81390691>] ? _raw_read_lock_irqsave+0x21/0x2a [Tue Mar 25 11:36:43 2014] [<ffffffff8107f677>] ? arch_local_irq_restore+0x7/0x8 [Tue Mar 25 11:36:43 2014] [<ffffffff8139083f>] ? _raw_read_unlock_irqrestore+0xb/0xc [Tue Mar 25 11:36:43 2014] [<ffffffffa02d320a>] ? dm_get_live_table+0x35/0x3d [dm_mod] [Tue Mar 25 11:36:43 2014] [<ffffffffa029caf7>] ? md_make_request+0xee/0x1df [md_mod] [Tue Mar 25 11:36:43 2014] [<ffffffff811ac10f>] ? generic_make_request+0x96/0xd5 [Tue Mar 25 11:36:43 2014] [<ffffffff811acd1a>] ? submit_bio+0x10a/0x13b [Tue Mar 25 11:36:43 2014] [<ffffffffa0460dfb>] ? dispatch_rw_block_io+0x32d/0x3d7 [xen_blkback] [Tue Mar 25 11:36:43 2014] [<ffffffff81067c91>] ? load_balance+0xb1/0x5de [Tue Mar 25 11:36:43 2014] [<ffffffff8100413f>] ? arch_local_irq_restore+0x7/0x8 [Tue Mar 25 11:36:43 2014] [<ffffffff810042e3>] ? xen_mc_flush+0x11e/0x161 [Tue Mar 25 11:36:43 2014] [<ffffffff81003173>] ? xen_end_context_switch+0xe/0x1c [Tue Mar 25 11:36:43 2014] [<ffffffff8100383f>] ? xen_mc_issue.constprop.22+0x27/0x4d [Tue Mar 25 11:36:43 2014] [<ffffffffa04610fd>] ? __do_block_io_op+0x258/0x390 [xen_blkback] [Tue Mar 25 11:36:43 2014] [<ffffffff8105f63c>] ? mmdrop+0xd/0x1c [Tue Mar 25 11:36:43 2014] [<ffffffff810602aa>] ? finish_task_switch+0x83/0xae [Tue Mar 25 11:36:43 2014] [<ffffffffa0461568>] ? xen_blkif_schedule+0x30b/0x417 [xen_blkback] [Tue Mar 25 11:36:43 2014] [<ffffffff8105858f>] ? abort_exclusive_wait+0x79/0x79 [Tue Mar 25 11:36:43 2014] [<ffffffffa046125d>] ? xen_blkif_be_int+0x28/0x28 [xen_blkback] [Tue Mar 25 11:36:43 2014] [<ffffffffa046125d>] ? xen_blkif_be_int+0x28/0x28 [xen_blkback] [Tue Mar 25 11:36:43 2014] [<ffffffff81057bf5>] ? kthread+0x81/0x89 [Tue Mar 25 11:36:43 2014] [<ffffffff8100383f>] ? xen_mc_issue.constprop.22+0x27/0x4d [Tue Mar 25 11:36:43 2014] [<ffffffff81057b74>] ? __kthread_parkme+0x5d/0x5d [Tue Mar 25 11:36:43 2014] [<ffffffff81395bbc>] ? ret_from_fork+0x7c/0xb0 [Tue Mar 25 11:36:43 2014] [<ffffffff81057b74>] ? __kthread_parkme+0x5d/0x5d [Tue Mar 25 11:37:17 2014] scsi host7: ib_srp: DREQ received - connection closed [Tue Mar 25 11:37:19 2014] scsi host7: ib_srp: connection closed ===
Looks like node just closed connection. I assume, this is normal behaviour. So, I've checked target node (storage) for errors: === [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 5 for idx 16 failed with status 12 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 16 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 5 for idx 6 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 6 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 5 for idx 21 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 21 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 5 for idx 14 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 14 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 5 for idx 29 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 29 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: RDMA t 4 for idx 28 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 24 failed with status 5 [Tue Mar 25 11:35:26 2014] ib_srpt: sending response for idx 28 failed with status 5 [Tue Mar 25 11:36:22 2014] ib_srpt: Received SRP_LOGIN_REQ with i_port_id 31ad:1f00:03c9:0200:0025:90ff:ff90:4081, t_port_id 0002:c903:001f:ad30:0002:c903:001f:ad30 and it_iu_len 260 on port 1 (guid=fe80:0000:0000:0000:0002:c903:001f:ad31) [Tue Mar 25 11:36:22 2014] scst: Using security group "ib_srpt_target_0" for initiator "0x31ad1f0003c90200002590ffff904081" (target ib_srpt_target_0) [Tue Mar 25 11:37:01 2014] ib_srpt: RDMA t 5 for idx 0 failed with status 12 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 3 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 0 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: RDMA t 5 for idx 2 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: RDMA t 4 for idx 5 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 2 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: RDMA t 5 for idx 4 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 4 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: RDMA t 5 for idx 1 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 1 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 6 failed with status 5 [Tue Mar 25 11:37:01 2014] ib_srpt: sending response for idx 5 failed with status 5 === I can't understand what does status 5 means. Also, I googled for this error, and managed to find some information from mailing lists: === mlx4_0/ports/1/counters/excessive_buffer_overrun_errors:0 mlx4_0/ports/1/counters/link_error_recovery:0 mlx4_0/ports/1/counters/local_link_integrity_errors:0 mlx4_0/ports/1/counters/port_rcv_constraint_errors:0 mlx4_0/ports/1/counters/port_rcv_errors:117 mlx4_0/ports/1/counters/port_rcv_remote_physical_errors:0 mlx4_0/ports/1/counters/port_rcv_switch_relay_errors:0 mlx4_0/ports/1/counters/port_xmit_constraint_errors:0 mlx4_0/ports/1/counters/symbol_error:177 === Thanks -- Best regards, Egor http://aylium.net -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
