The difference between your Intel and AMD nodes may be the RPC checksum type 
that is used by default (the clients and servers negotiate the fastest 
algorithm).

I suspect the checksum error is itself fixed already, but in the meantime you 
could try setting a different checksum than t10ip4k (or whatever it is you are 
using, compare "lctl get_param osc.*.checksum_type" on your Intel vs. AMD 
clients).

Cheers, Andreas

On Jun 3, 2024, at 08:21, Fokke Dijkstra via lustre-discuss 
<[email protected]<mailto:[email protected]>> wrote:

Dear all,

We are frequently (about daily) seeing the following type of error in our 
logfile on some specific client nodes:

Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(integrity.c:66:obd_page_dif_generate_buffer()) 
scratch-OST0042-osc-ff35febc655a9000: unexpected used guard number of DIF 5/5, 
data length 4096, sector s
ize 512: rc = -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(osc_request.c:2750:osc_build_rpc()) prep_req failed: -7
Jun  1 11:03:17 a100gpu1 kernel: LustreError: 
3834:0:(osc_cache.c:2186:osc_check_rpcs()) Write request failed with -7

We are running Lustre 2.15.4 over Ethernet on Rocky 8 servers and clients.
The error only appears on the client, nothing is found on the servers around 
that time period.

The errors mostly appear on our Intel ice lake based GPU nodes and less 
frequently on Intel ice lake based CPU nodes. We do not see the errors on our 
AMD Zen 3 nodes (the latter form the majority of our cluster).

The problem was brought to our attention by a few users that were running 
Pytorch code on the GPU nodes, who complained about Pytorch giving an error 
about writing a file and then failing.
When checking the log files the error appears to occur more often and I can't 
find a clear correlation with specific job types and neither with job failures 
(some jobs seem to continue to run after the error appears in the system log 
file).

Has anyone seen this error before? Does somebody know how to fix this?

Kind regards,

Fokke Dijkstra

--
Fokke Dijkstra <[email protected]><mailto:[email protected]>
Team High Performance Computing
Center for Information Technology, University of Groningen
Postbus 11044, 9700 CA  Groningen, The Netherlands

_______________________________________________
lustre-discuss mailing list
[email protected]<mailto:[email protected]>
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Cheers, Andreas
--
Andreas Dilger
Lustre Principal Architect
Whamcloud







_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to