Hi, the problem is the cleanup of the tokens and/or the openfile objects. i suggest you open a defect for this.
sven On Thu, Jul 12, 2018 at 8:22 AM Billich Heinrich Rainer (PSI) < [email protected]> wrote: > > > > > Hello Sven, > > > > The machine has > > > > maxFilesToCache 204800 (2M) > > > > it will become a CES node, hence the higher than default value. It’s just > a 3 node cluster with remote cluster mount and no activity (yet). But all > three nodes are listed as token server by ‘mmdiag –tokenmgr’. > > > > Top showed 100% idle on core 55. This matches the kernel messages about > rmmod being stuck on core 55. > > I didn’t see a dominating thread/process, but many kernel threads showed > 30-40% CPU, in sum that used about 50% of all cpu available. > > > > This time mmshutdown did return and left the module loaded, next mmstartup > tried to remove the ‘old’ module and got stuck :-( > > > > I append two links to screenshots > > > > Thank you, > > > > Heiner > > > > https://pasteboard.co/Hu86DKf.png > > https://pasteboard.co/Hu86rg4.png > > > > If the links don’t work I can post the images to the list. > > > > Kernel messages: > > > > [ 857.791050] CPU: 55 PID: 16429 Comm: rmmod Tainted: G W OEL > ------------ 3.10.0-693.17.1.el7.x86_64 #1 > > [ 857.842265] Hardware name: HP ProLiant DL380 Gen9/ProLiant DL380 Gen9, > BIOS P89 01/22/2018 > > [ 857.884938] task: ffff883ffafe8fd0 ti: ffff88342af30000 task.ti: > ffff88342af30000 > > [ 857.924120] RIP: 0010:[<ffffffff8119202e>] [<ffffffff8119202e>] > compound_unlock_irqrestore+0xe/0x20 > > [ 857.970708] RSP: 0018:ffff88342af33d38 EFLAGS: 00000246 > > [ 857.999742] RAX: 0000000000000000 RBX: ffff88207ffda068 RCX: > 00000000000000e5 > > [ 858.037165] RDX: 0000000000000246 RSI: 0000000000000246 RDI: > 0000000000000246 > > [ 858.074416] RBP: ffff88342af33d38 R08: 0000000000000000 R09: > 0000000000000000 > > [ 858.111519] R10: ffff88207ffcfac0 R11: ffffea00fff40280 R12: > 0000000000000200 > > [ 858.148421] R13: 00000001fff40280 R14: ffffffff8118cd84 R15: > ffff88342af33ce8 > > [ 858.185845] FS: 00007fc797d1e740(0000) GS:ffff883fff0c0000(0000) > knlGS:0000000000000000 > > [ 858.227062] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 858.257819] CR2: 00000000004116d0 CR3: 0000003fc2ec0000 CR4: > 00000000001607e0 > > [ 858.295143] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > > [ 858.332145] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > > [ 858.369097] Call Trace: > > [ 858.384829] [<ffffffff816a29f6>] put_compound_page+0x149/0x174 > > [ 858.416176] [<ffffffff81192275>] put_page+0x45/0x50 > > [ 858.443185] [<ffffffffc09be4ba>] cxiReleaseAndForgetPages+0xda/0x220 > [mmfslinux] > > [ 858.481751] [<ffffffffc09beaed>] ? cxiDeallocPageList+0xbd/0x110 > [mmfslinux] > > [ 858.518206] [<ffffffffc09bea75>] cxiDeallocPageList+0x45/0x110 > [mmfslinux] > > [ 858.554438] [<ffffffff816adfe0>] ? _raw_spin_lock+0x10/0x30 > > [ 858.585522] [<ffffffffc09bec6a>] cxiFreeSharedMemory+0x12a/0x130 > [mmfslinux] > > [ 858.622670] [<ffffffffc0b69982>] kxFreeAllSharedMemory+0xe2/0x160 > [mmfs26] > > [ 858.659246] [<ffffffffc0b54d15>] mmfs+0xc85/0xca0 [mmfs26] > > [ 858.689379] [<ffffffffc09a3d26>] gpfs_clean+0x26/0x30 [mmfslinux] > > [ 858.722330] [<ffffffffc0c9c945>] cleanup_module+0x25/0x30 [mmfs26] > > [ 858.755431] [<ffffffff8110044b>] SyS_delete_module+0x19b/0x300 > > [ 858.786882] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b > > [ 858.818776] Code: 89 ca 44 89 c1 4c 8d 43 10 e8 6f 2b ff ff 89 c2 48 89 > 13 5b 5d c3 0f 1f 80 00 00 00 00 55 48 89 e5 f0 80 67 03 fe 48 89 f7 57 9d > <0f> 1f 44 00 00 5d c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 > > [ 859.068528] hrtimer: interrupt took 2877171 ns > > [ 870.517924] INFO: rcu_sched self-detected stall on CPU { 55} (t=240003 > jiffies g=18437 c=18436 q=194992) > > [ 870.577882] Task dump for CPU 55: > > [ 870.602837] rmmod R running task 0 16429 16374 > 0x00000008 > > [ 870.645206] Call Trace: > > [ 870.666388] <IRQ> [<ffffffff810c58a8>] sched_show_task+0xa8/0x110 > > [ 870.704271] [<ffffffff810c9309>] dump_cpu_task+0x39/0x70 > > [ 870.738421] [<ffffffff811399f0>] rcu_dump_cpu_stacks+0x90/0xd0 > > [ 870.775339] [<ffffffff8113d012>] rcu_check_callbacks+0x442/0x730 > > [ 870.812353] [<ffffffff810f4ee0>] ? tick_sched_do_timer+0x50/0x50 > > [ 870.848875] [<ffffffff8109c076>] update_process_times+0x46/0x80 > > [ 870.884847] [<ffffffff810f4ce0>] tick_sched_handle+0x30/0x70 > > [ 870.919740] [<ffffffff810f4f19>] tick_sched_timer+0x39/0x80 > > [ 870.953660] [<ffffffff810b6864>] __hrtimer_run_queues+0xd4/0x260 > > [ 870.989276] [<ffffffff810b6dff>] hrtimer_interrupt+0xaf/0x1d0 > > [ 871.023481] [<ffffffff81053a05>] local_apic_timer_interrupt+0x35/0x60 > > [ 871.061233] [<ffffffff816bea4d>] smp_apic_timer_interrupt+0x3d/0x50 > > [ 871.097838] [<ffffffff816b9d32>] apic_timer_interrupt+0x232/0x240 > > [ 871.133232] <EOI> [<ffffffff816a287e>] ? put_page_testzero+0x8/0x15 > > [ 871.170089] [<ffffffff816a29fe>] put_compound_page+0x151/0x174 > > [ 871.204221] [<ffffffff81192275>] put_page+0x45/0x50 > > [ 871.234554] [<ffffffffc09be4ba>] cxiReleaseAndForgetPages+0xda/0x220 > [mmfslinux] > > [ 871.275763] [<ffffffffc09beaed>] ? cxiDeallocPageList+0xbd/0x110 > [mmfslinux] > > [ 871.316987] [<ffffffffc09bea75>] cxiDeallocPageList+0x45/0x110 > [mmfslinux] > > [ 871.356886] [<ffffffff816adfe0>] ? _raw_spin_lock+0x10/0x30 > > [ 871.389455] [<ffffffffc09bec6a>] cxiFreeSharedMemory+0x12a/0x130 > [mmfslinux] > > [ 871.429784] [<ffffffffc0b69982>] kxFreeAllSharedMemory+0xe2/0x160 > [mmfs26] > > [ 871.468753] [<ffffffffc0b54d15>] mmfs+0xc85/0xca0 [mmfs26] > > [ 871.501196] [<ffffffffc09a3d26>] gpfs_clean+0x26/0x30 [mmfslinux] > > [ 871.536562] [<ffffffffc0c9c945>] cleanup_module+0x25/0x30 [mmfs26] > > [ 871.572110] [<ffffffff8110044b>] SyS_delete_module+0x19b/0x300 > > [ 871.606048] [<ffffffff816b89fd>] system_call_fastpath+0x16/0x1b > > > > -- > > Paul Scherrer Institut > > Science IT > > Heiner Billich > > WHGA 106 > > CH 5232 Villigen PSI > > 056 310 36 02 > > https://www.psi.ch > > > > > > *From: *<[email protected]> on behalf of Sven > Oehme <[email protected]> > > > *Reply-To: *gpfsug main discussion list <[email protected]> > > *Date: *Thursday 12 July 2018 at 15:42 > > > *To: *gpfsug main discussion list <[email protected]> > *Subject: *Re: [gpfsug-discuss] /sbin/rmmod mmfs26 hangs on mmshutdown > > > > if that happens it would be interesting what top reports > > > > start top in a large resolution window (like 330x80) , press shift-H , > this will break it down per Thread, also press 1 to have a list of each cpu > individually and see if you can either spot one core on the top list with > 0% idle or on the thread list on the bottom if any of the threads run at > 100% core speed. > > attached is a screenshot which columns to look at , this system is idle, > so nothing to see, just to show you where to look > > > > does this machine by any chance has either large maxfilestochache or is a > token server ? > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
