Re: [lustre-discuss] Lustre caching and NUMA nodes

Andreas Dilger via lustre-discuss Tue, 05 Dec 2023 20:33:48 -0800

On Dec 4, 2023, at 15:06, John Bauer 
<[email protected]<mailto:[email protected]>> wrote:

I have a an OSC caching question. I am running a dd process which writes an
8GB file. The file is on lustre, striped 8x1M. This is run on a system that
has 2 NUMA nodes (cpu sockets). All the data is apparently stored on one NUMA
node (node1 in the plot below) until node1 runs out of free memory. Then it
appears that dd comes to a stop (no more writes complete) until lustre dumps
the data from the node1. Then dd continues writing, but now the data is stored
on the second NUMA node, node0. Why does lustre go to the trouble of dumping
node1 and then not use node1's memory, when there was always plenty of free
memory on node0?

I'll forego the explanation of the plot. Hopefully it is clear enough. If
someone has questions about what the plot is depicting, please ask.

https://www.dropbox.com/scl/fi/pijgnnlb8iilkptbeekaz/dd.png?rlkey=3abonv5tx8w5w5m08bn24qb7x&dl=0

Hi John,
thanks for your detailed analysis. It would be good to include the client
kernel and Lustre version in this case, as the page cache behaviour can vary
dramatically between different versions.

The allocation of the page cache pages may actually be out of the control of
Lustre, since they are typically being allocated by the kernel VM affine to the
core where the process that is doing the IO is running. It may be that the
"dd" is rescheduled to run on node0 during the IO, since the ptlrpcd threads
will be busy processing all of the RPCs during this time, and then dd will
start allocating pages from node0.

That said, it isn't clear why the client doesn't start flushing the dirty data
from cache earlier? Is it actually sending the data to the OSTs, but then
waiting for the OSTs to reply that the data has been committed to the storage
before dropping the cache?

It would be interesting to plot the osc.*.rpc_stats::write_rpcs_in_flight and
::pending_write_pages to see if the data is already in flight. The
osd-ldiskfs.*.brw_stats on the server would also useful to graph over the same
period, if possible.

It *does* look like the "node1 dirty" is kept at a low value for the entire
run, so it at least appears that RPCs are being sent, but there is no page
reclaim triggered until memory is getting low. Doing page reclaim is really
the kernel's job, but it seems possible that the Lustre client may not be
suitably notifying the kernel about the dirty pages and kicking it in the butt
earlier to clean up the pages.

PS: my preference would be to just attach the image to the email instead of
hosting it externally, since it is only 55 KB. Is this blocked by the list
server?

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Lustre caching and NUMA nodes

Reply via email to