Re: [lustre-discuss] kernel threads for rpcs in flight

Andreas Dilger via lustre-discuss Thu, 02 May 2024 22:26:53 -0700

On May 2, 2024, at 18:10, Anna Fuchs 
<[email protected]<mailto:[email protected]>> wrote:
The number of ptlrpc threads per CPT is set by the "ptlrpcd_partner_group_size" 
module parameter, and defaults to 2 threads per CPT, IIRC.  I don't think that 
clients dynamically start/stop ptlrpcd threads at runtime.
When there are RPCs in the queue for any ptlrpcd it will be woken up and 
scheduled by the kernel, so it will compete with the application threads.  
IIRC, if a ptlrpcd thread is woken up and there are no RPCs in the local CPT 
queue it will try to steal RPCs from another CPT on the assumption that the 
local CPU is not generating any RPCs so it would be beneficial to offload 
threads on another CPU that *is* generating RPCs.  If the application thread is 
extremely CPU hungry, then the kernel will not schedule the ptlrpcd threads on 
those codes very often, and the "idle" core ptlrpcd threads will be be able to 
run more frequently.


Sorry, maybe I am confusing things. I am still not sure how many threads I get.
For example I have a 32 cores AMD Epyc machine as a client and I am running a 
serial stream io application with a single stripesize, 1 OST.
I am struggeling to find out how many CPU partitions I have - is it something 
on the hardware side or something configurable?
There is no file /proc/sys/lnet/cpu_partitions on my client.

This is a module parameter, since it cannot be changed at runtime.  This is 
visible at /sys/module/libcfs/parameters/cpu_npartitions and the default value 
depends on the number of CPU cores and NUMA configuration.  It can be specified 
with "options libcfs cpu_npartitions=" in /etc/modprobe.d/lustre.conf.

Assuming I had 2 CPU partitions, that would result in 4 ptlrpc threads at 
system start, right?

Correct.

Now I set  rpcs_in_flight to 1 or to 8, what effect does that have on the 
number and the activity of the threads?

Setting rpcs_in_flight has no effect on the number of ptlrpcd threads.  The 
ptlrpcd threads process RPCs asynchronously (unlike server threads) so they can 
keep many RPCs in progress.

Serial stream, 1 rpcs_in_flight is waking up only one ptlrpc thread, 3 remain 
inactive/sleep/do nothing?

This depends.  There are two ptlrpcd threads for the CPT that can process the 
RPCs from the one user thread.  If they can send the RPCs quickly enough then 
the other ptlrpcd threads may not steal the RPCs from that CPT.

That said, even a single threaded userspace writer may have up to 8 RPCs in 
flight *per OST* (depending on the file striping and if IO submission allows it 
- buffered or AIO+DIO) so if there are a lot of outstanding RPCs and RPC 
generation takes a long time (e.g. compression) then it may be that all ptlrpcd 
threads will be busy.

Does not seem to be the case, as I've applied the rpctracing (thanks a lot for 
the hint!!), and rpcs_in_flight being 1 still show at least 3 different threads 
from at least 2 different partitions for writing a 1MB file with ten blocks.
I don't get the relationship between these values.

What are the opcodes from the different RPCs?  The ptlrpcd threads are only 
handling asynchronous RPCs like buffered writes, statfs, and a few others.  
Many RPCs are processed in the context of the application thread, not by 
ptlrpcd.

And, if I had compression or any other heavy load, which settings could clearly 
control how many resources I want to give Lustre for this load? I can see a 
clear scaling with higher rpcs in flight, but I am struggeling to understand 
the numbers and attribute them to a specific settings. Uncompressed case 
already benefits a bit by higher RPCs number due to multiple "substreaming", 
but there must be much more happening in parallel behind the scenes for 
compressed case even with rpcs_in_flight=1.

The "cpu_npartitions" module parameter controls how many groups the cores are 
split into.  The "cpu_pattern" parameter can control the specific cores in each 
of the CPTs, which would affect the default per-CPT ptlrpcd threads location. 
It is possible to further use the "ptlrpcd_cpts" and "ptlrpcd_per_cpt_max" 
parameters to control specifically which cores are used for the threads.

It is entirely possible that the number of ptlrpcd threads and CPT 
configuration is becoming sub-optimal as the number of multi-chip package CPUs 
with many cores grows dramatically.  It is a balance between having enough 
threads to maximize performance without having so many that it goes down hill 
again.  Ideally this should all happen without the need to hand-tune the CPT 
and thread count for every CPU on the market.

Cheers, Andreas

Thank you!

Anna


Whether this behavior is optimal or not is subject to debate, and 
investigation/improvements are of course welcome.  Definitely, data checksums 
have some overhead (a few percent), and client-side data compression (which is 
done by ptlrpcd threads) would have a significant usage of CPU cycles, but 
given the large number of CPU cores on client nodes these days this may still 
provide a net performance benefit if the IO bottleneck is on the server.

With max_rpcs_in_flight = 1, multiple cores are loaded, presumably alternately, 
but the statistics are too inaccurate to capture this.  The distribution of 
threads to cores is regulated by the Linux kernel, right? Does anyone have 
experience with what happens when all CPUs are under full load with the 
application or something else?

Note that {osc,mdc}.*.max_rpcs_in_flight is a *per target* parameter, so a 
single client can still have tens or hundreds of RPCs in flight to different 
servers.  The client will send many RPC types directly from the process 
context, since they are waiting on the result anyway.  For asynchronous bulk 
RPCs, the ptlrpcd thread will try to process the bulk IO on the same CPT (= 
Lustre CPU Partition Table, roughly aligned to NUMA nodes) as the userspace 
application was running when the request was created.  This minimizes the 
cross-NUMA traffic when accessing pages for bulk RPCs, so long as those cores 
are not busy with userspace tasks.  Otherwise, the ptlrpcd thread on another 
CPT will steal RPCs from the queues.

Do the Lustre threads suffer? Is there a prioritization of the Lustre threads 
over other tasks?

Are you asking about the client or the server?  Many of the client RPCs are 
generated by the client threads, but for the running ptlrpcd threads do not 
have a higher priority than client application threads.  If the application 
threads are running on some cores, but other cores are idle, then the ptlrpcd 
threads on other cores will try to process the RPCs to allow the application 
threads to continue running there.  Otherwise, if all cores are busy (as is 
typical for HPC applications) then they will be scheduled by the kernel as 
needed.

Are there readily available statistics or tools for this scenario?

What statistics are you looking for?  There are "{osc,mdc}.*.stats" and 
"{osc,mdc}.*rpc_stats" that have aggregate information about RPC counts and 
latency.

Oh, right, these tell a lot. Isn't there also something to log the utilization 
and location of these threads? Otherwise, I'll continue trying with perf, which 
seems to be more complex with kernel threads.

There are kernel debug logs available when "lctl set_param debug=+rpctrace" is 
enabled, that will show which ptlrpcd or application thread is handling each 
RPC, and on which core it was run on.  These can be found on the client by 
searching for "Sending RPC|Completed RPC" in the debug logs, for example:

# lctl set_param debug=+rpctrace
# lctl set_param jobid_var=procname_uid
# cp -a /etc /mnt/testfs
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
    :
    :
00000100:00100000:2.0:1714502851.435000:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0<mailto:ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0>
00000100:00100000:2.0:1714502851.436117:0:23892:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9b2948640 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0<mailto:ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634353438336:0@lo:2:cp.0>

Shows that thread "ptlrpcd_01_00" (CPT 01, thread 00, pid 23892) was sent on 
core 2.0 (no hyperthread) and sent an OST_SETATTR (opc = 2) RPC on behalf of 
"cp" for root (uid=0), and competed in 1117msec.

Similarly, with a "dd" sync write workload it shows write RPCs by the ptlrpcd 
threads, and sync RPCs in the "dd" process context:
# dd if=/dev/zero of=/mnt/testfs/file bs=4k count=10000 oflag=dsync
# lctl dk /tmp/debug
# grep -E "Sending RPC|Completed RPC" /tmp/debug
    :
    :
00000100:00100000:2.0:1714503761.136971:0:23892:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0<mailto:ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0>
00000100:00100000:2.0:1714503761.140288:0:23892:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad6640 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0<mailto:ptlrpcd_01_00:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23892:1797634358961024:0@lo:4:dd.0>
00000100:00100000:2.0:1714503761.140518:0:17993:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
     
dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0<mailto:dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0>
00000100:00100000:2.0:1714503761.141556:0:17993:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
     
dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0<mailto:dd:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:17993:1797634358961088:0@lo:44:dd.0>
00000100:00100000:2.0:1714503761.141885:0:23893:0:(client.c:1758:ptlrpc_send_new_req())
     Sending RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0<mailto:ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0>
00000100:00100000:2.0:1714503761.144172:0:23893:0:(client.c:2239:ptlrpc_check_set())
     Completed RPC req@ffff90c9a6ad3040 pname:cluuid:pid:xid:nid:opc:job
     
ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0<mailto:ptlrpcd_01_01:e81f3122-b1bc-4ac4-afcb-f6629a81e5bd:23893:1797634358961152:0@lo:16:dd.0>

There are no stats files that aggregate information about ptlrpcd thread 
utilization.

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud









Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] kernel threads for rpcs in flight

Reply via email to