Hi Ross, On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerw...@citrix.com> wrote: > Each gfs2_bufdata stores a reference to a glock but the reference count > isn't incremented. This causes an occasional use-after-free of the > glock. Fix by taking a reference on the glock during allocation and > dropping it when freeing. > > Found by KASAN: > > BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2] > Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371 > > CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1 > Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010 > Workqueue: glock_workqueue glock_work_func [gfs2] > Call Trace: > dump_stack+0x71/0xab > print_address_description+0x6a/0x270 > kasan_report+0x258/0x380 > ? revoke_lo_after_commit+0x8e/0xe0 [gfs2] > revoke_lo_after_commit+0x8e/0xe0 [gfs2] > gfs2_log_flush+0x511/0xa70 [gfs2] > ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2] > ? __brelse+0x48/0x50 > ? gfs2_log_commit+0x4de/0x6e0 [gfs2] > ? gfs2_trans_end+0x18d/0x340 [gfs2] > gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2] > ? inode_go_dump+0xe0/0xe0 [gfs2] > ? inode_go_sync+0xe4/0x220 [gfs2] > inode_go_sync+0xe4/0x220 [gfs2] > do_xmote+0x12b/0x290 [gfs2] > glock_work_func+0x6f/0x160 [gfs2] > process_one_work+0x461/0x790 > worker_thread+0x69/0x6b0 > ? process_one_work+0x790/0x790 > kthread+0x1ae/0x1d0 > ? kthread_create_worker_on_cpu+0xc0/0xc0 > ret_from_fork+0x22/0x40
thanks for tracking this down, very interesting. The consistency model here is that every buffer head that a struct gfs2_bufdata object is attached to is protected by a glock. Before a glock can be released, all the buffers under that glock have to be flushed out and released; this is what allows another node to access the same on-disk location without causing inconsistencies. When there is a bufdata object that points to a glock that has already been freed, this consistency model is broken. Taking an additional refcount as this patch does may make the use-after-free go away, but it doesn't fix the underlying problem. So I think we'll need a different fix here. Did you observe this problem in a real-world scenario, or with KASAN only? It might be that we're looking at a small race that is unlikely to trigger in the field. In any case, I think we need to understand better what't actually going on. Thanks, Andreas