Re: [lustre-discuss] Read/Write on specific stripe of file via C api

Apostolis Stamatis Wed, 11 Dec 2024 02:13:21 -0800

Hello,

Coming back to this, I have proceeded with the one-file approach.

I am using a toy cluster with 1 combined MGS/MDT, 4 OSTs and 4 clients,each client handling a different section of the file in parallel.Clients are running containerized in the same VM as the OST.The file is striped across all 4 OSTs and a stripe size of 1MB is used(unless mentioned otherwise).I am using different file sizes to measure performance, ranging from~50MB to ~2.5GB. I am measuring end to end times for reading/writing thefile.


I have performed the following experiments:

A) Using a variable size buffer of sizes 1MB, 2MB, 4MB to performread/write calls.B) To try and see if stripe alignment is beneficial, I alignedread/write calls so that they only handle one stripe. If I understandcorrectly, this means that each call is in the form `pwrite(fd, buffer,size, offset)` (same for pread), where offset is a multiple ofstripe_size and size=stripe_size (buffer size = stripe_size). For this,stripe_size = buffer size = 1MB is used.C) Without taking care of stripe alignment and a buffer of 1MB, try todetermine if stripe_size is important by experimenting with the valuesstripe_size=65536, 655360, 6553600, 1MB.

For a given file size, the results are almost identical for both readand write across all my experiments.


My questions are:

Q1) Is the way I am trying to align calls with stripes (and in effectmake sure each call only needs one OST) correct ?Q2) If it is indeed correct, is it expected that I don't see anydifference when aligning calls with stripes vs when I am not ? Based onour discussion and best practices I found online, I would expect thatwhen alignment is taken into consideration performance is better.Q3) Is it expected that I don't see any difference in performance usingvariable stripe sizes (with fixed size of read/write operations, namely1MB) ?Q4) Is it expected that I don't see any difference in performance usingvariable size of read/write operations (with fixed stripe_size 1MB) ?Q4) If the parameters mentioned should indeed affect performance, anyidea what the reason might be that in my setup no difference isobserved? E.g. I was thinking that MGS/MDT node could be slow and thus abottleneck, or the files are too small to see any significant differenceetc.

Any additional things I might be missing to better understand what isgoing on?


Thanks again for the help,

Apostolis

On 12/10/24 23:30, Andreas Dilger wrote:

On Sep 30, 2024, at 13:26, Apostolis Stamatis <[email protected]>wrote:
Thank you very much Andreas.

Your explanation was very insightful.

I do have the following questions/thoughts:
Let's say I have 2 available OSTs, and 4MB of data. The stripe-sizeis 1MB. (Sizes are small for discussion purposes, I am trying tounderstand what solution -if any- would perform better in general)
I would like to compare the following two strategies ofwriting/reading the data:
A) I can store all the data in 1 single big lustre file, stripedacross the 2 OSTs.
B) I can create (e.g.) 4 smaller lustre files, each consisting of1MB of data. Suppose I place them manually in the same way that theywould be striped on strategy A.
So the only difference between the 2 strategies is whether data is ina single lustre file or not (meaning I make sure each OST has asimilar load in both cases).
Then:
Q1. Suppose I have 4 simultaneous processes, each wanting to read 1MBof data. On strategy A, each process opens the file (viallapi_file_open) and then reads the corresponding data by calculatingthe offset from the start. On strategy B each process simply opensthe corresponding file and reads its data. Would there be anydifference in performance between the two strategies ?
For reading it is unlikely that there would be a significantdifference in performance. For writing, option A would be somewhatslower than B for large amounts of data, because there would be somelock contention between parallel writers to the same file.
However, if this behavior is expanded to a large scale, then havingmillions or billions of 1MB files would have a different kind ofoverhead to open/close each file separately and having to manage somany those files vs. having fewer/larger files. Given that a singleclient can read/write GB/s, it makes sense to aggregate enough dataper file to amortize the overhead of the lookup/open/stat/close.
Large-scale HPC applications try to pick a middle ground, for examplehaving 1 file per checkpoint timestep written in parallel (instead of1M separate per-CPU files), but each timestep (hourly) has a differentfile. Alternately, each timestep could write individual files into aseparate directory, if they are reasonably large (e.g. GB).
Q2. Suppose I have 1 process, wanting to read the (e.g.) 3rd MB ofdata. Would strategy B be better, since it avoids the overhead of"skipping" to the offset that is required in strategy A ?
Seeking the offset pointer within a file has no cost. That is justchanging a number in the open file descriptor on the client, so itdoesn't involve the servers or any kind of locking.
Q3. For question 2, would the answer be different if the read is notaligned to the stripe-size? Meaning that in both strategies I wouldhave to skip to an offset (compared to Q2 where I could just read thewhole file in strategy B from the start), but in strategy A the skipis bigger.
Same answer as 2 - the seeking itself has no cost. The *read* ofunaligned data in this case is likely to be somewhat slower thanreading aligned data (it may send RPCs to two OSTs, needing twoseparate locks, etc). However, with any large-sized read (e.g. 8 MB+)it is unlikely to make a significant difference.
Q4. One concern I have regarding strategy A is that all the stripesof the file that are in the same OST are seen -internally- as oneobject (as per "Understanding Lustre Internals"). Does this affectperformance when different, but not overlapping, parts of the file(that are on the same OST) are being accessed (for example due tolocking)? Does it matter if the parts being accessed are on different"chunk", e.g 1st and 3rd MB on the above example?
No, Lustre can allow concurrent read access to a single object frommultiple threads/clients. When writing the file, there can also beconcurrent write access to a single object, but only withnon-overlapping regions. That would also be true if writing toseparate files in option B (contention if two processes tried to writethe same small file).
Also if there are any additional docs I can read on those topics(apart from "Understanding Lustre internals") to get a betterunderstanding, please do point them out.
Patrick Farrell has presented at LAD and LUG a few times aboutoptimizations to the IO pipeline, which may be interesting:
https://wiki.lustre.org/Lustre_User_Group_2022
- https://wiki.lustre.org/images/a/a3/LUG2022-Future_IO_Path-Farrell.pdf
https://www.eofs.eu/index.php/events/lad-23/
-https://www.eofs.eu/wp-content/uploads/2024/02/04-LAD-2023-Unaligned-DIO.pdf
https://wiki.lustre.org/Lustre_User_Group_2024
-https://wiki.lustre.org/images/a/a0/LUG2024-Hybrid_IO_Path_Update-Farrell.pdf
Thanks again for your help,

Apostolis


On 9/23/24 00:42, Andreas Dilger wrote:
On Sep 18, 2024, at 10:47, Apostolis Stamatis <[email protected]>wrote:
I am trying to read/write a specific stripe for files stripedacross multiple OSTs. I've been looking around the C api but withno success so far.
Let's say I have a big file which is striped across multiple OSTs.I have a cluster of compute nodes which perform some computation onthe data of the file. Each node needs only a subset of that data.
I want each node to be able to read/write only the neededinformation, so that all reads/writes can happen in parallel. Thedesired data may or may not be aligned with the stripes (this issecondary).
It is my understanding that stripes are just parts of the file.Meaning that if I have an array of 100 rows and stripe A containsthe first half, then it would contain the first 50 rows, is thiscorrect?
This is not totally correct. The location of the data depends onthe size of the data and the stripe size.
For a 1-stripe file (the default unless otherwise specified) thenall of the data would be in a single object, regardless of the sizeof the data.
For a 2-stripe file with stripe_size=1MiB, then the first MB of data[0-1MB) is on object 0, the second MB of data [1-2MB) is on object1, and the third MB of data [2-3MB) is back on object 0, etc.
Seehttps://wiki.lustre.org/Understanding_Lustre_Internals#Lustre_File_Layouts forexample.
To sum up my questions are:
1) Can I read/write a specific stripe of a file via the C api toachieve better performance/locality?
There is no Lustre llapi_* interface that provides thisfunctionality, but you can of course read the file with regularread() or preferably pread() or readv() calls with the right fileoffsets.
2) Is it correct that stripes include parts of the file, meaningthe raw data? If not, can the raw data be extracted from anyadditional information stored in the stripe?
For example, if you have a 4-stripe file, then the applicationshould read every 4th MB of the file to stay on the same OST object.Note that the *OST* index is not necessarily the same as the*stripe* number of the file. To read the file from the local OSTthen it should check the local OST index and select that OST indexfrom the file to determine the offset from the start of the file =stripe_size * stripe_number.
However, you could also do this more easily by having a bunch of1-stripe files and doing the reads directly on the local OSTs. Youwould run "lfs find DIR -i LOCAL_OST_IDX" to get a list of the fileson each OST, and then process them directly.
3) If each compute node is run on top of a different OST wherestripes of the file are stored, would it be better in terms ofperformance to have the node read the stripe of its OST? (becausee.g. it avoids data transfer over the network)
This is not necessarily needed, if you have a good network, but itdepends on the workload. Local PCI storage access is about the samespeed as remote PCI network access because they are limited by thePCI bus bandwidth. You would notice a difference is if you have alarge number of clients and they are completely IO-bound thatoverwhelm the storage.
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud
Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Re: [lustre-discuss] Read/Write on specific stripe of file via C api

Reply via email to