Dear Lustre community,

We have a FSx for Lustre configuration at AWS (FSx Persistant SSD 1.2TB 
250mb/s/TiB; FSx for Lustre server version 2.12). From an EC2 with Ubuntu that 
file system is mounted using the Lustre Client Modules. The version of these 
lustre clients depends on the Linux kernel.

On the file system we have compressed data tables (80000 row x 1440 col and 
transposed as well). These are stored using the fst library/package for R 
(https://www.fstpackage.org/). The data are stored columnwise in a serialised 
manner. The benefit is, one can read a set of columns (or rows) without having 
to read the whole file (similar to Parquet files). It is (one of) the fastest 
way to read/write data from R.

Recently we discovered that on a newly configured EC2 reading data from these 
files on the Lustre file system is a lot slower than on an older EC2.
After some debugging it was found that an EC2 with Ubuntu 18.04.6 LTS and 
kernel 5.4.0-1083-aws using the Lustre Client 2.10.8 has the same fast 
performance as expected (same as the older EC2). However, upgrading the Lustre 
Client to 2.12.8 (nothing else is different... same machine) results in poor 
performance. The job to test the speed is reading 4 columns from 180 files 
containing a data table as described above. This takes about 5 seconds when it 
is fast, but slows down to 20 seconds in the slow case.

In addition, just reading one whole table (one file) using read_fst takes 
about: 1 - 2 seconds with Lustre Client 2.10.8 20 - 22 seconds with Lustre 
Client 2.12.8 1-2 seconds with Lustre Client 2.15.4 (on Ubuntu 22... another 
EC2).

Reading the files immediately again using the Lustre Client 2.12.8 improves the 
performance back to 1 - 2 seconds. So, when they are cached (somewhere), the 
performance is OK, but a cold read is very slow. In contrast, the other two 
(2.10.8 and 2.15.4) are already very fast reading the files the first time.

I would use 2.15.4 which is the latest supported version using the highest 
supported Ubuntu version (22), but unfortunately the performance of 2.15.4 is 
similar to that of 2.12.8 in the first test (reading 4 columns from 180 files 
takes about 20 seconds instead of around 5). As a result, we're stuck with 
Ubuntu 18 and kernel 5.4.0 which is the latest combination that still supports 
Lustre Client 2.10.8 (which is fast in all cases).

The test has been repeated a lot of times at different times to rule out 
caching behaviour.

What could be the reason for these large performance differences (4x to 10x 
slower)? Are there perhaps some parameter settings different between the Lustre 
Client versions? Can those be adjusted?

I also posted this question on the AWS re:Post “forum”: 
https://repost.aws/questions/QUCiF-XpFaS0al162IYKXg1w/fsx-for-lustre-client-2-12-very-slow-compared-to-2-10

Kind regards,

Anton Wijbenga

[A white and orange logo  Description automatically generated]
[cid:[email protected]]
+31 6 14 86 86 67<+31614868667>
[cid:[email protected]]
Van Deventerlaan 20<https://maps.app.goo.gl/uLNCBe5z6FQVMLfK6>, 3528 AE, 
Utrecht<https://maps.app.goo.gl/uLNCBe5z6FQVMLfK6>
[cid:[email protected]]
/anton-wijbenga<https://www.linkedin.com/in/anton-wijbenga-b9798623/>
[cid:[email protected]]
https://www.maptm.nl<https://www.maptm.nl/>
[cid:[email protected]]
[email protected]<mailto:[email protected]>
Not available on wednesdays


_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to