Hi Ross,

On Thu, 31 Jan 2019 at 11:56, Ross Lagerwall <ross.lagerw...@citrix.com> wrote:
> Each gfs2_bufdata stores a reference to a glock but the reference count
> isn't incremented. This causes an occasional use-after-free of the
> glock. Fix by taking a reference on the glock during allocation and
> dropping it when freeing.
> Found by KASAN:
> BUG: KASAN: use-after-free in revoke_lo_after_commit+0x8e/0xe0 [gfs2]
> Write of size 4 at addr ffff88801aff6134 by task kworker/0:2H/20371
> CPU: 0 PID: 20371 Comm: kworker/0:2H Tainted: G O 4.19.0+0 #1
> Hardware name: Dell Inc. PowerEdge R805/0D456H, BIOS 4.2.1 04/14/2010
> Workqueue: glock_workqueue glock_work_func [gfs2]
> Call Trace:
>  dump_stack+0x71/0xab
>  print_address_description+0x6a/0x270
>  kasan_report+0x258/0x380
>  ? revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  revoke_lo_after_commit+0x8e/0xe0 [gfs2]
>  gfs2_log_flush+0x511/0xa70 [gfs2]
>  ? gfs2_log_shutdown+0x1f0/0x1f0 [gfs2]
>  ? __brelse+0x48/0x50
>  ? gfs2_log_commit+0x4de/0x6e0 [gfs2]
>  ? gfs2_trans_end+0x18d/0x340 [gfs2]
>  gfs2_ail_empty_gl+0x1ab/0x1c0 [gfs2]
>  ? inode_go_dump+0xe0/0xe0 [gfs2]
>  ? inode_go_sync+0xe4/0x220 [gfs2]
>  inode_go_sync+0xe4/0x220 [gfs2]
>  do_xmote+0x12b/0x290 [gfs2]
>  glock_work_func+0x6f/0x160 [gfs2]
>  process_one_work+0x461/0x790
>  worker_thread+0x69/0x6b0
>  ? process_one_work+0x790/0x790
>  kthread+0x1ae/0x1d0
>  ? kthread_create_worker_on_cpu+0xc0/0xc0
>  ret_from_fork+0x22/0x40

thanks for tracking this down, very interesting.

The consistency model here is that every buffer head that a struct
gfs2_bufdata object is attached to is protected by a glock. Before a
glock can be released, all the buffers under that glock have to be
flushed out and released; this is what allows another node to access
the same on-disk location without causing inconsistencies. When there
is a bufdata object that points to a glock that has already been
freed, this consistency model is broken. Taking an additional refcount
as this patch does may make the use-after-free go away, but it doesn't
fix the underlying problem. So I think we'll need a different fix

Did you observe this problem in a real-world scenario, or with KASAN
only? It might be that we're looking at a small race that is unlikely
to trigger in the field. In any case, I think we need to understand
better what't actually going on.


Reply via email to