On Sun, Jan 4, 2026 at 11:42 AM Gao Xiang <[email protected]> wrote: > > > > On 2026/1/4 18:01, Amir Goldstein wrote: > > [+fsdevel][+overlayfs] > > > > On Sun, Jan 4, 2026 at 4:56 AM Gao Xiang <[email protected]> > > wrote: > >> > >> Hi Amir, > >> > >> On 2026/1/1 23:52, Amir Goldstein wrote: > >>> On Wed, Dec 31, 2025 at 9:42 PM Gao Xiang <[email protected]> > >>> wrote: > >>>> > >>>> Previously, commit d53cd891f0e4 ("erofs: limit the level of fs stacking > >>>> for file-backed mounts") bumped `s_stack_depth` by one to avoid kernel > >>>> stack overflow, but it breaks composefs mounts, which need erofs+ovl^2 > >>>> sometimes (and such setups are already used in production for quite long > >>>> time) since `s_stack_depth` can be 3 (i.e., FILESYSTEM_MAX_STACK_DEPTH > >>>> needs to change from 2 to 3). > >>>> > >>>> After a long discussion on GitHub issues [1] about possible solutions, > >>>> it seems there is no need to support nesting file-backed mounts as one > >>>> conclusion (especially when increasing FILESYSTEM_MAX_STACK_DEPTH to 3). > >>>> So let's disallow this right now, since there is always a way to use > >>>> loopback devices as a fallback. > >>>> > >>>> Then, I started to wonder about an alternative EROFS quick fix to > >>>> address the composefs mounts directly for this cycle: since EROFS is the > >>>> only fs to support file-backed mounts and other stacked fses will just > >>>> bump up `FILESYSTEM_MAX_STACK_DEPTH`, just check that `s_stack_depth` > >>>> != 0 and the backing inode is not from EROFS instead. > >>>> > >>>> At least it works for all known file-backed mount use cases (composefs, > >>>> containerd, and Android APEX for some Android vendors), and the fix is > >>>> self-contained. > >>>> > >>>> Let's defer increasing FILESYSTEM_MAX_STACK_DEPTH for now. > >>>> > >>>> Fixes: d53cd891f0e4 ("erofs: limit the level of fs stacking for > >>>> file-backed mounts") > >>>> Closes: https://github.com/coreos/fedora-coreos-tracker/issues/2087 [1] > >>>> Closes: > >>>> https://lore.kernel.org/r/CAFHtUiYv4+=+JP_-JjARWjo6OwcvBj1wtYN=z0qxwcpec9s...@mail.gmail.com > >>>> Cc: Amir Goldstein <[email protected]> > >>>> Cc: Alexander Larsson <[email protected]> > >>>> Cc: Christian Brauner <[email protected]> > >>>> Cc: Miklos Szeredi <[email protected]> > >>>> Signed-off-by: Gao Xiang <[email protected]> > >>>> --- > >>> > >>> Acked-by: Amir Goldstein <[email protected]> > >>> > >>> But you forgot to include details of the stack usage analysis you ran > >>> with erofs+ovl^2 setup. > >>> > >>> I am guessing people will want to see this information before relaxing > >>> s_stack_depth in this case. > >> > >> Sorry I didn't check emails these days, I'm not sure if posting > >> detailed stack traces are useful, how about adding the following > >> words: > > > > Didn't mean detailed stack traces, but you did some tests with the > > new possible setup and you reached stack usage < 8K so I think this is > > The issue is that my limited stress test setup cannot cover > every cases: > > - I cannot find a way to make direct reclaim reliably in the > deep memory allocation, is there some suggestion on this? > > - I'm not sure what's the perfered way to evaluate the worst > stack usage below the block layer, but we should care more > about increasing delta just out of one more overlayfs I > guess? > > I can only say what I've seen is the peak stack usage of my > fsstress for an erofs+ovl^2 setup on x86_64 is < 8K (7184 bytes, > but I don't think the peak value absolutely useful), which > evaluates RW workloads in the upperdir, and for such workloads, > the stack depth won't be impacted by FILESYSTEM_MAX_STACK_DEPTH, > I don't see such workload is harmful. > > And then I manually copyup some files (because I didn't find any > available tool to stress overlayfs copyups) and I could see the > delta is (I think "ovl_copy_up_" is the only one path to do > copyups): > > 0) 6688 48 mempool_alloc_slab+0x9/0x20 > 1) 6640 56 mempool_alloc_noprof+0x65/0xd0 > 2) 6584 72 __sg_alloc_table+0x128/0x190 > 3) 6512 40 sg_alloc_table_chained+0x46/0xa0 > 4) 6472 64 scsi_alloc_sgtables+0x91/0x2c0 > 5) 6408 72 sd_init_command+0x263/0x930 > 6) 6336 88 scsi_queue_rq+0x54a/0xb70 > 7) 6248 144 blk_mq_dispatch_rq_list+0x265/0x6c0 > 8) 6104 144 __blk_mq_sched_dispatch_requests+0x399/0x5c0 > 9) 5960 16 blk_mq_sched_dispatch_requests+0x2d/0x70 > 10) 5944 56 blk_mq_run_hw_queue+0x208/0x290 > 11) 5888 96 blk_mq_dispatch_list+0x13f/0x460 > 12) 5792 48 blk_mq_flush_plug_list+0x4b/0x180 > 13) 5744 32 blk_add_rq_to_plug+0x3d/0x160 > 14) 5712 136 blk_mq_submit_bio+0x4f4/0x760 > 15) 5576 120 __submit_bio+0x9b/0x240 > 16) 5456 88 submit_bio_noacct_nocheck+0x271/0x330 > 17) 5368 72 iomap_bio_read_folio_range+0xde/0x1d0 > 18) 5296 112 iomap_read_folio_iter+0x1ee/0x2d0 > 19) 5184 264 iomap_readahead+0xb9/0x290 > 20) 4920 48 xfs_vm_readahead+0x4a/0x70 > 21) 4872 112 read_pages+0x6c/0x1b0 > 22) 4760 104 page_cache_ra_unbounded+0x12c/0x210 > 23) 4656 80 filemap_readahead.isra.0+0x78/0xb0 > 24) 4576 192 filemap_get_pages+0x3a6/0x820 > 25) 4384 376 filemap_read+0xde/0x380 > 26) 4008 32 xfs_file_buffered_read+0xa6/0xd0 > 27) 3976 16 xfs_file_read_iter+0x6a/0xd0 > 28) 3960 48 vfs_iocb_iter_read+0xdb/0x140 > 29) 3912 88 erofs_fileio_rq_submit+0x136/0x190 > 30) 3824 368 z_erofs_runqueue+0x1ce/0x9f0 > 31) 3456 232 z_erofs_readahead+0x16c/0x220 > 32) 3224 112 read_pages+0x6c/0x1b0 > 33) 3112 104 page_cache_ra_unbounded+0x12c/0x210 > 34) 3008 80 filemap_readahead.isra.0+0x78/0xb0 > 35) 2928 192 filemap_get_pages+0x3a6/0x820 > 36) 2736 400 filemap_splice_read+0x12c/0x2f0 > 37) 2336 48 backing_file_splice_read+0x3f/0x90 > 38) 2288 128 ovl_splice_read+0xef/0x170 > 39) 2160 104 splice_direct_to_actor+0xb9/0x260 > 40) 2056 88 do_splice_direct+0x76/0xc0 > 41) 1968 120 ovl_copy_up_file+0x1a8/0x2b0 > 42) 1848 840 ovl_copy_up_one+0x14b0/0x1610 > 43) 1008 72 ovl_copy_up_flags+0xd7/0x110 > 44) 936 56 ovl_open+0x72/0x110 > 45) 880 56 do_dentry_open+0x16c/0x480 > 46) 824 40 vfs_open+0x2e/0xf0 > 47) 784 152 path_openat+0x80a/0x12e0 > 48) 632 296 do_filp_open+0xb8/0x160 > 49) 336 80 do_sys_openat2+0x72/0xf0 > 50) 256 40 __x64_sys_openat+0x57/0xa0 > 51) 216 40 do_syscall_64+0xa4/0x310 > 52) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f > > And it's still far from the stack overflow of 16k stacks, > because the difference seems only how many ( > ovl_splice_read + backing_file_splice_read), and there only takes > hundreds of bytes for each layer. > > Finally I used my own rostress to stress RO workloads, and the > deepest stack so far is as below (5456 bytes): > > 0) 5456 48 arch_scale_cpu_capacity+0x9/0x30 > 1) 5408 16 cpu_util.constprop.0+0x7e/0xe0 > 2) 5392 392 sched_balance_find_src_group+0x29f/0xd30 > 3) 5000 280 sched_balance_rq+0x1b2/0xf10 > 4) 4720 120 pick_next_task_fair+0x23b/0x7b0 > 5) 4600 104 __schedule+0x2bc/0xda0 > 6) 4496 16 schedule+0x27/0xd0 > 7) 4480 24 io_schedule+0x46/0x70 > 8) 4456 120 blk_mq_get_tag+0x11b/0x280 > 9) 4336 96 __blk_mq_alloc_requests+0x2a1/0x410 > 10) 4240 136 blk_mq_submit_bio+0x59c/0x760 > 11) 4104 120 __submit_bio+0x9b/0x240 > 12) 3984 88 submit_bio_noacct_nocheck+0x271/0x330 > 13) 3896 72 iomap_bio_read_folio_range+0xde/0x1d0 > 14) 3824 112 iomap_read_folio_iter+0x1ee/0x2d0 > 15) 3712 264 iomap_readahead+0xb9/0x290 > 16) 3448 48 xfs_vm_readahead+0x4a/0x70 > 17) 3400 112 read_pages+0x6c/0x1b0 > 18) 3288 104 page_cache_ra_unbounded+0x12c/0x210 > 19) 3184 80 filemap_readahead.isra.0+0x78/0xb0 > 20) 3104 192 filemap_get_pages+0x3a6/0x820 > 21) 2912 376 filemap_read+0xde/0x380 > 22) 2536 32 xfs_file_buffered_read+0xa6/0xd0 > 23) 2504 16 xfs_file_read_iter+0x6a/0xd0 > 24) 2488 48 vfs_iocb_iter_read+0xdb/0x140 > 25) 2440 88 erofs_fileio_rq_submit+0x136/0x190 > 26) 2352 368 z_erofs_runqueue+0x1ce/0x9f0 > 27) 1984 232 z_erofs_readahead+0x16c/0x220 > 28) 1752 112 read_pages+0x6c/0x1b0 > 29) 1640 104 page_cache_ra_unbounded+0x12c/0x210 > 30) 1536 40 force_page_cache_ra+0x96/0xc0 > 31) 1496 192 filemap_get_pages+0x123/0x820 > 32) 1304 376 filemap_read+0xde/0x380 > 33) 928 72 do_iter_readv_writev+0x1b9/0x220 > 34) 856 56 vfs_iter_read+0xde/0x140 > 35) 800 64 backing_file_read_iter+0x193/0x1e0 > 36) 736 56 ovl_read_iter+0x98/0xa0 > 37) 680 72 do_iter_readv_writev+0x1b9/0x220 > 38) 608 56 vfs_iter_read+0xde/0x140 > 39) 552 64 backing_file_read_iter+0x193/0x1e0 > 40) 488 56 ovl_read_iter+0x98/0xa0 > 41) 432 152 vfs_read+0x21a/0x350 > 42) 280 64 __x64_sys_pread64+0x92/0xc0 > 43) 216 40 do_syscall_64+0xa4/0x310 > 44) 176 176 entry_SYSCALL_64_after_hwframe+0x77/0x7f > > > something worth mentioning. > > > >> > >> Note: There are some observations while evaluating the erofs + ovl^2 > >> setup with an XFS backing fs: > >> > >> - Regular RW workloads traverse only one overlayfs layer regardless of > >> the value of FILESYSTEM_MAX_STACK_DEPTH, because `upperdir=` cannot > >> point to another overlayfs. Therefore, for pure RW workloads, the > >> typical stack is always just: > >> overlayfs + upper fs + underlay storage > >> > >> - For read-only workloads and the copy-up read part (ovl_splice_read), > >> the difference can lie in how many overlays are nested. > >> The stack just looks like either: > >> ovl + ovl [+ erofs] + backing fs + underlay storage > >> or > >> ovl [+ erofs] + ext4/xfs + underlay storage > >> > >> - The fs reclaim path should be entered only once, so the writeback > >> path will not re-enter. > >> > >> Sorry about my English, and I'm not sure if it's enough (e.g. FUSE > >> passthrough part). I will look for your further inputs (and other > >> acks) before sending this patch upstream. > >> > > > > I think that most people will have problems understanding this > > rationale not because of the English, but because of the tech ;) > > this is a bit too hand wavy IMO. > > Honestly, I don't have better way to describe it, I think we'd > better just to focus more on the increment of one more overlayfs: >
ok. but are we talking about one more overlayfs? This patch is adding just one erofs, so what am I missing? > FILESYSTEM_MAX_STACK_DEPTH 2 already works for 8k kstacks on > 32-bit arches, so I don't think FILESYSTEM_MAX_STACK_DEPTH from > 2 to 3, which causes hundreds-more-byte additional stack usage > out of mediate overlayfs on 16k kstacks on 64-bit arches is > harmful (and only RO workloads and copyups are impacted). > > And if hundreds-more-byte additional stack usage can overflow > the 16k kstack, I do think then the kernel stack can be > overflowed randomly everywhere in the storage stack, not just > because this FILESYSTEM_MAX_STACK_DEPTH modification. > Fine by me, but does that mean that you only want to allow erofs backing files with >8K stack size? Otherwise, I do not follow your argument. Thanks, Amir.
