Problem: Race in fscache operation enqueuing for reading and copying
multiple pages from cachefiles to netfs causes kernel oops or at worst
memory scribbling.
Symptoms:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000008
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP:
[<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD
6caa59d067 PMD0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in:
nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4
xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc
aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE)
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6
nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat
x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp
nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass
crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac
joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect
sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi
acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd
nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE)
ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE)
ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE)
ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE)
nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca
mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp
libahci pps_core mdio fjes
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681
Comm: kworker/u162:7 Tainted: P OE 4.4.0-45-generic
#66~14.04.1-Ubuntu
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name:
XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000
ti: ffff882804c44000 task.ti: ffff882804c44000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP:
0010:[<ffffffff81095951>] [<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP:
0000:ffff882804c47e28 EFLAGS: 00010046
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00
RBX: ffff887ec7b81a40 RCX: 0000000000000001
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00
RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60
R08: 0000000000000101 R09: 00000001802a001b
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01
R11: ffffea01f7fc2c00 R12: ffff887f657a0000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000
R14: ffff88192acc0000 R15: ffff887dff0b0240
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS:
0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8
CR3: 0000005ef67a2000 CR4: 00000000003406e0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851] 00000000657a0018
0000000000000000 ffff887f657a0000 ffff887f657a0018
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877] ffff887ec7b81a70
ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902] ffffffff810961ea
ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942] [<ffffffff810961ea>]
worker_thread+0x11a/0x470
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963] [<ffffffff817f6659>] ?
__schedule+0x359/0x980
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982] [<ffffffff810960d0>] ?
rescuer_thread+0x310/0x310
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003] [<ffffffff8109b959>]
kthread+0xc9/0xe0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020] [<ffffffff8109b890>] ?
kthread_park+0x60/0x60
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039] [<ffffffff817fa88f>]
ret_from_fork+0x3f/0x70
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056] [<ffffffff8109b890>] ?
kthread_park+0x60/0x60
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 49
89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b 67
48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41
c1 ee 05 41 83 e6 01 41 f6
May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP
[<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881] RSP <ffff882804c47e28>
May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008
May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428]
---[ end trace 164b63cc6ad868c8 ]---
Root cause:
Lets Take an example of reading two pages.
Thread1:
nfs_readpages
nfs_readpages_from_fscache
__nfs_readpages_from_fscache
__fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages,
end_io_func, context , gfp)
op = fscache_alloc_retrieval(cookie, mapping, end_io_func, context)
(op->usage =1)
|cachefiles_read_or_alloc_pages
|cachefiles_read_backing_file
|op->op.processor = cache_files_read_copier
for each netpage in list of netpages
| allocate a monitor(cachefiles_one_read) one for each page (
tracker for netfs page , cachefs_page)
| fscache_get_retreival(op) (op->usage ++)
| find backing page. allocate if not found
| read_backing_page: bmapping->a_ops->readpage(NULL,backpage) -
read the page from disk
| add_page_wait_queue(monitor->monitor, cachefiles_read_waiter)
| fscache_put_retrieval(op) (op->usage --)
So for a simple two page read , op->usage is at 2 (one for each page)
Thread 2
cachefiles_read_waiter (monitor)
fscache_enqueue_retrieval
add the monitor to op->to_do list
##### slight DELAY for second page, so the op->to_do list has two
pages, ####
##### op->usage is at 3 (2 from above + 1 for the first page)
#########
fscache_enqueue_operation
## so by the time the second page is being queued, the op is
freed and usage is at 0. ###
fscache_get_retreival(op) (op->usage ++)
queue_work(fscache_op_wq, &op->work) op->processor is set to
cache_files_read_copier
So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD
case)
Thread 3
fscache_op_work_func(op)
cachefiles_read_copier()
### loops two times for two pages as the op->to_do has two pages
while (op->to_do) --> monitor {
copy_highpage(monitor->netfs_page, monitor->back_page);
fscache_mark_page_cached(monitor->op, monitor->netfs_page);
put_page(monitor->back_page);
fscache_end_io(op, monitor->netfs_page, error); --> notify page
copy done to Netfs
put_page(monitor->netfs_page);
fscache_retrieval_complete(op, 1);
fscache_put_retrieval(op); (op->usage--)
kfree(monitor);
}
fscache_put_retrieval(op); (op->usage --)
# if the first callback finishes before the delay, op->usage is at 0
and the memory for OP is freed and potentially reused
# if the first callback pushes into the cache->op_gc_list , we may end
up with ASSERTION FAILED (5 == 6)
# ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED &&
op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==,
FSCACHE_OP_ST_CANCELLED);
#possible fix:
#clearly adding the page to to_do list in cachefiles_read_copier prior to
bumping the reference is a problem. bump up the op->usage before queuing
under the object. This ensures the op is not freed. (op->usage is atleast
1)
# enqueue the operation for second page. op->usage is 2
# decrement the op->usage so that the second callback frees properly.
static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
int sync, void
*_key)
{
...
...
spin_lock(&object->work_lock);
+ fscache_get_retrieval(monitor->op);
list_add_tail(&monitor->op_link, &monitor->op->to_do);
spin_unlock(&object->work_lock);
fscache_enqueue_retrieval(monitor->op);
+ fscache_put_retrieval(monitor->op);
return 0;
}
Thanks
Kiran
--
Linux-cachefs mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cachefs