On Tue, Nov 19, 2019 at 5:10 PM Ciprian Dorin Craciun <[email protected]> wrote: > > # echo t > /proc/sysrq-trigger
At the following link you can find an extract of `dmesg` after the sysrq trigger. https://scratchpad.volution.ro/ciprian/f89fc32a0bbd0ae6d6f3edbbc3ee111c/b9c3bc4f795bbe9e7eaca93b0a57bea0.txt (I have filtered processes that don't have `afs` in their name, mainly because it exposes all my workstation's processes. However I can provide privately a complete file.) The following is the process which gets stuck (it took almost ~25 minutes to complete, and it is not input file related): ~~~~ gm S 0 27572 27562 0x80000000 Call Trace: ? __schedule+0x2be/0x6d0 schedule+0x39/0xa0 afs_cv_wait+0x10a/0x300 [libafs] ? wake_up_q+0x60/0x60 rxi_WriteProc+0x21d/0x410 [libafs] ? rxfs_storeUfsWrite+0x55/0xb0 [libafs] ? afs_GenericStoreProc+0x11a/0x1f0 [libafs] ? afs_CacheStoreDCaches+0x1a9/0x5b0 [libafs] ? afs_CacheStoreVCache+0x32c/0x680 [libafs] ? __filemap_fdatawrite_range+0xca/0x100 ? afs_osi_Wakeup+0xb/0x60 [libafs] ? afs_UFSGetDSlot+0xf6/0x4f0 [libafs] ? afs_StoreAllSegments+0x725/0xc20 [libafs] ? afs_linux_flush+0x486/0x4e0 [libafs] ? filp_close+0x32/0x70 ? __x64_sys_close+0x1e/0x50 ? do_syscall_64+0x6e/0x200 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe ~~~~ On a second try (that also lockups) the following is the stack-trace (only for the blocked process) (they look almost identical): ~~~~ gm S 0 30548 30545 0x80004000 Call Trace: ? __schedule+0x2be/0x6d0 schedule+0x39/0xa0 afs_cv_wait+0x10a/0x300 [libafs] ? wake_up_q+0x60/0x60 rxi_WriteProc+0x21d/0x410 [libafs] ? rxfs_storeUfsWrite+0x55/0xb0 [libafs] ? afs_GenericStoreProc+0x11a/0x1f0 [libafs] ? afs_CacheStoreDCaches+0x1a9/0x5b0 [libafs] ? afs_CacheStoreVCache+0x32c/0x680 [libafs] ? __filemap_fdatawrite_range+0xca/0x100 ? afs_osi_Wakeup+0xb/0x60 [libafs] ? afs_UFSGetDSlot+0xf6/0x4f0 [libafs] ? afs_StoreAllSegments+0x725/0xc20 [libafs] ? afs_linux_flush+0x486/0x4e0 [libafs] ? filp_close+0x32/0x70 ? __x64_sys_close+0x1e/0x50 ? do_syscall_64+0x6e/0x200 ? entry_SYSCALL_64_after_hwframe+0x49/0xbe ~~~~ I can reliably trigger the issue almost 50% of the times, by just doing the following: * remove a few files (in my case ~15) which should trigger the rebuild of around x2; * start the build with maximum 8 processes concurrency; * all the processes execute similar jobs, with similarly sized inputs, outputs and used CPU time; Based on `htop` I would say that neither `ninja` which does the heavy `stat`-ing, neither `gm` (an ImageMagik alternative) are multi-threaded. The build procedure involves the following AFS related operations: * check if the output exists, and if so `rm`; * create an `output.tmp` file; * move the `output.tmp` to `output`; No other proceses are actively using AFS (except `mc` and a couple of `bash` which have their `cwd` into an AFS volume). (The `[nodaemon]` process is a simple tool that uses `prtcl (PR_SET_CHILD_SUBREAPER)` to catch double forking processses, and also has the `cwd` into AFS.) Hope it helps, Ciprian. _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
