[Expired for linux (Ubuntu) because there has been no activity for 60
days.]

** Changed in: linux (Ubuntu)
       Status: Incomplete => Expired

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1599681

Title:
  ceph-osd process hung and blocked ps listings

Status in ceph package in Ubuntu:
  Expired
Status in linux package in Ubuntu:
  Expired

Bug description:
  We ran into a situation over the past couple of days where we had 2
  different ceph-osd nodes crash in such a way that they caused ps
  listing to hang when enumerating the process.  Both had a call trace
  associated with them:

  Node 1:
  Jul  4 07:46:15 provider-cs-03 kernel: [4188396.493011] ceph-osd        D 
ffff882029a67b90     0  5312      1 0x00000004
  Jul  4 07:46:15 provider-cs-03 kernel: [4188396.590564]  ffff882029a67b90 
ffff881037cb8000 ffff8820284f3700 ffff882029a68000
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.688603]  ffff88203296e5a8 
ffff88203296e5c0 0000000000000015 ffff8820284f3700
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.789329]  ffff882029a67ba8 
ffffffff817ec495 ffff8820284f3700 ffff882029a67bf8
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.891376] Call Trace:
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.939271]  [<ffffffff817ec495>] 
schedule+0x35/0x80
  Jul  4 07:46:16 provider-cs-03 kernel: [4188396.989957]  [<ffffffff817eeb6a>] 
rwsem_down_read_failed+0xea/0x120
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.041502]  [<ffffffff813dbd84>] 
call_rwsem_down_read_failed+0x14/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.092616]  [<ffffffff813dbf15>] 
? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.141510]  [<ffffffff813dbf15>] 
? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.189877]  [<ffffffff817ee1f0>] 
? down_read+0x20/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.237513]  [<ffffffff81067f18>] 
__do_page_fault+0x398/0x430
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.285588]  [<ffffffff81067fd2>] 
do_page_fault+0x22/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.332936]  [<ffffffff817f1e78>] 
page_fault+0x28/0x30
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.379495]  [<ffffffff813dbf15>] 
? __clear_user+0x25/0x50
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.426400]  [<ffffffff81039c58>] 
copy_fpstate_to_sigframe+0x118/0x1d0
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.474904]  [<ffffffff8102d1fd>] 
get_sigframe.isra.7.constprop.9+0x12d/0x150
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.563204]  [<ffffffff8102d698>] 
do_signal+0x1e8/0x6d0
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.609783]  [<ffffffff816d19f2>] 
? __sys_sendmsg+0x42/0x80
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.656633]  [<ffffffff811b2ed0>] 
? handle_mm_fault+0x250/0x540
  Jul  4 07:46:16 provider-cs-03 kernel: [4188397.703785]  [<ffffffff8107884c>] 
exit_to_usermode_loop+0x59/0xa2
  Jul  4 07:46:17 provider-cs-03 kernel: [4188397.751367]  [<ffffffff81003a6e>] 
syscall_return_slowpath+0x4e/0x60
  Jul  4 07:46:17 provider-cs-03 kernel: [4188397.799369]  [<ffffffff817efe58>] 
int_ret_from_sys_call+0x25/0x8f

  Node 2:
  [733869.727139] CPU: 17 PID: 1735127 Comm: ceph-osd Not tainted 
4.4.0-15-generic #31-Ubuntu
  [733869.796954] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.3.6 
06/03/2015
  [733869.927182] task: ffff881841dc6e00 ti: ffff8810cc0a0000 task.ti: 
ffff8810cc0a0000
  [733870.059139] RIP: 0010:[<ffffffff810b479d>]  [<ffffffff810b479d>] 
task_numa_find_cpu+0x2cd/0x710
  [733870.192753] RSP: 0000:ffff8810cc0a3bd8  EFLAGS: 00010257
  [733870.260298] RAX: 0000000000000000 RBX: ffff8810cc0a3c78 RCX: 
0000000000000012
  [733870.389322] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
ffff8810210a0e00
  [733870.517883] RBP: ffff8810cc0a3c40 R08: 0000000000000006 R09: 
000000000000013e
  [733870.646335] R10: 00000000000003b4 R11: 000000000000001f R12: 
ffff881018118000
  [733870.774514] R13: 0000000000000006 R14: ffff8810210a0e00 R15: 
0000000000000379
  [733870.902262] FS:  00007fdcfab03700(0000) GS:ffff88203e600000(0000) 
knlGS:0000000000000000
  [733871.031347] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [733871.097820] CR2: 00007fdcfab02c20 CR3: 0000001029204000 CR4: 
00000000001406e0
  [733871.223381] Stack:
  [733871.282453]  ffff8810cc0a3c40 ffffffff811f04ce ffff88102f6e9680 
0000000000000012
  [733871.404947]  0000000000000077 000000000000008f 0000000000016d40 
0000000000000006
  [733871.527250]  ffff881841dc6e00 ffff8810cc0a3c78 00000000000001ac 
00000000000001b8
  [733871.649648] Call Trace:
  [733871.707884]  [<ffffffff811f04ce>] ? migrate_page_copy+0x21e/0x530
  [733871.770946]  [<ffffffff810b501e>] task_numa_migrate+0x43e/0x9b0
  [733871.832808]  [<ffffffff811c9700>] ? page_add_anon_rmap+0x10/0x20
  [733871.893897]  [<ffffffff810b5609>] numa_migrate_preferred+0x79/0x80
  [733871.954283]  [<ffffffff810b9c24>] task_numa_fault+0x7f4/0xd40
  [733872.013128]  [<ffffffff811bdf90>] handle_mm_fault+0xbc0/0x1820
  [733872.071309]  [<ffffffff81101420>] ? do_futex+0x120/0x500
  [733872.128149]  [<ffffffff812288c5>] ? __fget_light+0x25/0x60
  [733872.184044]  [<ffffffff8106a537>] __do_page_fault+0x197/0x400
  [733872.239300]  [<ffffffff8106a7c2>] do_page_fault+0x22/0x30
  [733872.293001]  [<ffffffff81824178>] page_fault+0x28/0x30
  [733872.345187] Code: d0 4c 89 f7 e8 95 c7 ff ff 49 8b 84 24 d8 01 00 00 49 
8b 76 78 31 d2 49 0f af 86 b0 00 00 00 4c 8b 45 d0 48 8b 4d b0 48 83 c6 01 <48> 
f7 f6 4c 89 c6 48 89 da 48 8d 3c 01 48 29 c6 e8 de c5 ff ff 
  [733872.507088] RIP  [<ffffffff810b479d>] task_numa_find_cpu+0x2cd/0x710
  [733872.559965]  RSP <ffff8810cc0a3bd8>
  [733872.673773] ---[ end trace aec37273a19e57dc ]---

  In the ceph logs for node 1 there is:

  ./include/interval_set.h: 340: FAILED assert(0)

   ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
   1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x8b) [0x56042ebdeceb]
   2: (()+0x4892b8) [0x56042e9512b8]
   3: (boost::statechart::simple_state<ReplicatedPG::WaitingOnReplicas, 
ReplicatedPG::SnapTrimmer, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, 
  mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
(boost::statechart::history_mode)
  0>::react_impl(boost::statechart::event_base const&, void const*)+0xb2) 
[0x56042e97a8d2]
   4: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
ReplicatedPG::NotTrimming, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_queue
  d_events()+0x127) [0x56042e9646e7]
   5: (boost::statechart::state_machine<ReplicatedPG::SnapTrimmer, 
ReplicatedPG::NotTrimming, std::allocator<void>, 
boost::statechart::null_exception_translator>::process_event
  (boost::statechart::event_base const&)+0x84) [0x56042e9648b4]
   6: (ReplicatedPG::snap_trimmer()+0x52c) [0x56042e8eb5dc]
   7: (OSD::SnapTrimWQ::_process(PG*)+0x1a) [0x56042e7807da]
   8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x56042ebcf8d6]
   9: (ThreadPool::WorkThread::entry()+0x10) [0x56042ebd0980]
   10: (()+0x8184) [0x7f27ecc66184]
   11: (clone()+0x6d) [0x7f27eb1d137d]
   NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to 
interpret this.

  Unfortunately the only way we could get the processes to respond again
  was to reboot the systems.

  Is there any way of figuring out what went wrong here?

  $ lsb_release -rd
  Description:  Ubuntu 14.04.4 LTS
  Release:      14.04

  $ dpkg-query -W ceph
  ceph  0.94.7-0ubuntu0.15.04.1~cloud0

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1599681/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to     : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp

Reply via email to