On Fri, Apr 24, 2015 at 6:41 PM, Nikola Ciprich
<[email protected]> wrote:
> Hello once again,
>
> I seem to have hit one more problem today:
> 3 nodes test cluster, nodes running 3.18.1 kernel,
> ceph-0.94.1, 3-replicas pool, backed by SSD osds.
Does this mean rbd device is mapped on a node that also runs one or
more osds?
>
> After mapping volume using rbd and trying to zero it
> using dd:
>
> dd if=/dev/zero of=/dev/rbd0 bs=1M
>
> it was running fine for some time with speed ~ 200 MB/s,
> but the speed was slowly dropping to ~70MB/s and then the process
> hung and following backtraces started to appear in dmesg:
>
> Apr 24 17:09:45 vfnphav1a kernel: [340710.888081] INFO: task
> kworker/u8:2:15884 blocked for more than 120 seconds.
> Apr 24 17:09:45 vfnphav1a kernel: [340710.895645] Not tainted
> 3.18.11lb6.01 #1
> Apr 24 17:09:45 vfnphav1a kernel: [340710.900612] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Apr 24 17:09:45 vfnphav1a kernel: [340710.909290] kworker/u8:2 D
> 0000000000000001 0 15884 2 0x00000000
> Apr 24 17:09:45 vfnphav1a kernel: [340710.917043] Workqueue: writeback
> bdi_writeback_workfn (flush-252:0)
> Apr 24 17:09:45 vfnphav1a kernel: [340710.923998] ffff880172b73608
> 0000000000000046 ffff88021424a850 0000000000004000
> Apr 24 17:09:45 vfnphav1a kernel: [340710.932595] ffff8801988d3120
> 0000000000011640 ffff880172b70010 0000000000011640
> Apr 24 17:09:45 vfnphav1a kernel: [340710.941193] 0000000000004000
> 0000000000011640 ffff8800d7689890 ffff8801988d3120
> Apr 24 17:09:45 vfnphav1a kernel: [340710.949799] Call Trace:
> Apr 24 17:09:45 vfnphav1a kernel: [340710.952746] [<ffffffff8149882e>] ?
> _raw_spin_unlock+0xe/0x30
> Apr 24 17:09:45 vfnphav1a kernel: [340710.959009] [<ffffffff8123ba6b>] ?
> queue_unplugged+0x5b/0xe0
> Apr 24 17:09:45 vfnphav1a kernel: [340710.965258] [<ffffffff81494149>]
> schedule+0x29/0x70
> Apr 24 17:09:45 vfnphav1a kernel: [340710.970728] [<ffffffff8149421c>]
> io_schedule+0x8c/0xd0
> Apr 24 17:09:45 vfnphav1a kernel: [340710.976462] [<ffffffff81239e95>]
> get_request+0x445/0x860
> Apr 24 17:09:45 vfnphav1a kernel: [340710.982366] [<ffffffff81086680>] ?
> bit_waitqueue+0x80/0x80
> Apr 24 17:09:45 vfnphav1a kernel: [340710.988443] [<ffffffff812358eb>] ?
> elv_merge+0xeb/0xf0
> Apr 24 17:09:45 vfnphav1a kernel: [340710.994167] [<ffffffff8123bdf8>]
> blk_queue_bio+0xc8/0x360
> Apr 24 17:09:45 vfnphav1a kernel: [340711.000159] [<ffffffff81239790>]
> generic_make_request+0xc0/0x100
> Apr 24 17:09:45 vfnphav1a kernel: [340711.006760] [<ffffffff81239841>]
> submit_bio+0x71/0x140
> Apr 24 17:09:45 vfnphav1a kernel: [340711.012489] [<ffffffff811b5aae>]
> _submit_bh+0x11e/0x170
> Apr 24 17:09:45 vfnphav1a kernel: [340711.018307] [<ffffffff811b5b10>]
> submit_bh+0x10/0x20
> Apr 24 17:09:45 vfnphav1a kernel: [340711.023865] [<ffffffff811b98e8>]
> __block_write_full_page.clone.0+0x198/0x340
> Apr 24 17:09:45 vfnphav1a kernel: [340711.031846] [<ffffffff811b9cb0>] ?
> I_BDEV+0x10/0x10
> Apr 24 17:09:45 vfnphav1a kernel: [340711.037313] [<ffffffff811b9cb0>] ?
> I_BDEV+0x10/0x10
> Apr 24 17:09:45 vfnphav1a kernel: [340711.042784] [<ffffffff811b9c5a>]
> block_write_full_page+0xba/0x100
> Apr 24 17:09:45 vfnphav1a kernel: [340711.049477] [<ffffffff811bab88>]
> blkdev_writepage+0x18/0x20
> Apr 24 17:09:45 vfnphav1a kernel: [340711.055642] [<ffffffff811231ca>]
> __writepage+0x1a/0x50
> Apr 24 17:09:45 vfnphav1a kernel: [340711.061374] [<ffffffff81124427>]
> write_cache_pages+0x1e7/0x4e0
> Apr 24 17:09:45 vfnphav1a kernel: [340711.067797] [<ffffffff811231b0>] ?
> set_page_dirty+0x60/0x60
> Apr 24 17:09:45 vfnphav1a kernel: [340711.073952] [<ffffffff81124774>]
> generic_writepages+0x54/0x80
> Apr 24 17:09:45 vfnphav1a kernel: [340711.080292] [<ffffffff811247c3>]
> do_writepages+0x23/0x40
> Apr 24 17:09:45 vfnphav1a kernel: [340711.086196] [<ffffffff811add39>]
> __writeback_single_inode+0x49/0x2c0
> Apr 24 17:09:45 vfnphav1a kernel: [340711.093131] [<ffffffff81086c8f>] ?
> wake_up_bit+0x2f/0x40
> Apr 24 17:09:45 vfnphav1a kernel: [340711.099028] [<ffffffff811af3b6>]
> writeback_sb_inodes+0x2d6/0x490
> Apr 24 17:09:45 vfnphav1a kernel: [340711.105625] [<ffffffff811af60e>]
> __writeback_inodes_wb+0x9e/0xd0
> Apr 24 17:09:45 vfnphav1a kernel: [340711.112223] [<ffffffff811af83b>]
> wb_writeback+0x1fb/0x320
> Apr 24 17:09:45 vfnphav1a kernel: [340711.118214] [<ffffffff811afa60>]
> wb_do_writeback+0x100/0x210
> Apr 24 17:09:45 vfnphav1a kernel: [340711.124466] [<ffffffff811afbe0>]
> bdi_writeback_workfn+0x70/0x250
> Apr 24 17:09:45 vfnphav1a kernel: [340711.131063] [<ffffffff814954de>] ?
> mutex_unlock+0xe/0x10
> Apr 24 17:09:45 vfnphav1a kernel: [340711.136974] [<ffffffffa02c4ef4>] ?
> bnx2x_release_phy_lock+0x24/0x30 [bnx2x]
> Apr 24 17:09:45 vfnphav1a kernel: [340711.144530] [<ffffffff8106529a>]
> process_one_work+0x13a/0x450
> Apr 24 17:09:45 vfnphav1a kernel: [340711.150872] [<ffffffff810656d2>]
> worker_thread+0x122/0x4f0
> Apr 24 17:09:45 vfnphav1a kernel: [340711.156944] [<ffffffff81086589>] ?
> __wake_up_common+0x59/0x90
> Apr 24 17:09:45 vfnphav1a kernel: [340711.163280] [<ffffffff810655b0>] ?
> process_one_work+0x450/0x450
> Apr 24 17:09:45 vfnphav1a kernel: [340711.169790] [<ffffffff8106a98e>]
> kthread+0xde/0x100
> Apr 24 17:09:45 vfnphav1a kernel: [340711.175253] [<ffffffff81050dc4>] ?
> do_exit+0x6e4/0xaa0
> Apr 24 17:09:45 vfnphav1a kernel: [340711.180987] [<ffffffff8106a8b0>] ?
> __init_kthread_worker+0x40/0x40
> Apr 24 17:09:45 vfnphav1a kernel: [340711.187757] [<ffffffff81498d88>]
> ret_from_fork+0x58/0x90
> Apr 24 17:09:45 vfnphav1a kernel: [340711.193652] [<ffffffff8106a8b0>] ?
> __init_kthread_worker+0x40/0x40
>
> the process started "running" after some time, but it's excruciatingly slow,
> with speeds about 40KB/s.
> all ceph processes seem to be mostly idle..
>
> From the backtrace I'm not sure if this can't be network adapter problem,
> since I see
> some bnc2x_ locking functions, but network seems to be running fine otherwise
> and I didn't have any issuess till I tried heavily using RBD..
>
> If I could provide some more information, please let me know.
Can you watch osd sockets in netstat for a while and describe what you
are seeing or forward a few representative samples?
Thanks,
Ilya
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com