On Wed, Nov 20, 2019 at 7:49 PM Mark Vitale <[email protected]> wrote: > > The following are the arguments of `fileserver`: > > -syslog -sync always -p 4 -b 524288 -l 524288 -s 1048576 -vc 4096 -cb > > 1048576 -vhandle-max-cachesize 32768 -jumbo -udpsize 67108864 > > -sendsize 67108864 -rxmaxmtu 9000 -rxpck 4096 -busyat 65536 > > I see some areas of concern here. First of all, many of your parameters > indicate that you expect to run relatively high load through this fileserver. > Yet there are only -p 4 server threads defined. The fileserver will > automatically > increase this to the minimum of 6, but that still seems quite low.
These parameters (at least most of them) were empirically identified for a highly concurrent access pattern, of a large number of 16KiB to 20MIB files, from a low number of users (2-3) over low-latency network (wired, GigaBit, same LAN). (I also had an IRC discussion with with Jeffrey about this topic.) There is a thread on this mailing list from 9th March 2019, with the subject <<Questions regarding `afsd` caching arguments (`-dcache` and `-files`)>>, where I've also listed the IRC discussion with Jeffrey about this topic. The `-p` argument is explicitly present in that discussion. The main use-case of my setup is a home / SOHO file server acting as a NAS. Therefore all my parameters are tuned towards low-latency and high-bandwidth access, at the expense of server RAM (thus the large number for buffers count and sizes). > This low thread number, combined with a very large -busyat value, > means that this fileserver will queue a very large backlog before returning > VBUSY to the client. Is there a reason you need to keep the fileserver > threads so low? Would it be possible for you to increase it dramatically > (perhaps 100) and try the test again? I've just increased this number to `-p 128`, and re-executed the build. (I haven't restarted the client, but I did restart the server.) Under initial parameters (i.e. 8 parallel builds) I wasn't able to replicate the issue in 10 tries. (The solution for this item seemed to be removing `-jumbo` and setting `-rxmaxmtu 1500` instead of `9000`.) Thus I've deleted around ~2K output files and increased the parallelism to 32. Under these conditions, although the build didn't block, the bandwidth (over wireless) was around 500KiB/s (receive) when I would have expected more (the input files are much larger than the output files, for instance ~300KiB in to ~25KiB out), and the task completion rate seemed verry jagged (i.e. no progress for a while, then all of a sudden 10 would finish). (I mention that the workload is not CPU bound, average CPU on client is around ~20%.) I've tried this second scenario (with the no-Jumbo settings) a few times and still nothing got stuck. However even if the case of "stuck process for 20 minutes" is solved, there is still the issue of trying to `SIGTERM` those waiting processes that jumps the kernel in 100% CPU. If I can try other experiments, please let me know. Thanks, Ciprian. _______________________________________________ OpenAFS-info mailing list [email protected] https://lists.openafs.org/mailman/listinfo/openafs-info
