Hi all,

In May I had a failure on a small cluster and asked here 
(http://lists.lustre.org/pipermail/lustre-discuss-lustre.org/2022-May/018073.html).
 Due to time constraints I just recreated the filesystem back then.

Now the failure happened again, this time I have more time and can investigate 
and haven't done anything destructive yet.

I use the following versions:

 *   lustre 2.14.56
 *   zfs 2.0.7 (previously used 2.1.2, but got told that 2.1.x is not tested 
well with lustre)
 *   Nodes are running Rocky Linux 8.6
 *   uname -r: 4.18.0-372.19.1.el8_6.aarch64

There are 2 IO nodes (io01 and io02), both of them are MDS and OSS and one of 
them is MGS. Here are the devices:

[snassyr@io02 ~]$ sudo lctl dl
 0 UP osd-zfs storage-MDT0001-osd storage-MDT0001-osd_UUID 8
 1 UP mgc MGC10.31.7.61@o2ib a087e05e-d57c-4561-ad75-6827d4428f54 4
 2 UP mds MDS MDS_uuid 2
 3 UP lod storage-MDT0001-mdtlov storage-MDT0001-mdtlov_UUID 3
 4 UP mdt storage-MDT0001 storage-MDT0001_UUID 8
 5 UP mdd storage-MDD0001 storage-MDD0001_UUID 3
 6 UP osp storage-MDT0000-osp-MDT0001 storage-MDT0001-mdtlov_UUID 4
 7 UP osp storage-OST0000-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4
 8 UP osp storage-OST0001-osc-MDT0001 storage-MDT0001-mdtlov_UUID 4
 9 UP lwp storage-MDT0000-lwp-MDT0001 storage-MDT0000-lwp-MDT0001_UUID 4
10 UP osd-zfs storage-OST0001-osd storage-OST0001-osd_UUID 4
11 UP ost OSS OSS_uuid 2
12 UP obdfilter storage-OST0001 storage-OST0001_UUID 6
13 UP lwp storage-MDT0000-lwp-OST0001 storage-MDT0000-lwp-OST0001_UUID 4
14 UP lwp storage-MDT0001-lwp-OST0001 storage-MDT0001-lwp-OST0001_UUID 4

[snassyr@io01 ~]$ sudo lctl dl
 0 UP osd-zfs MGS-osd MGS-osd_UUID 4
 1 UP mgs MGS MGS 6
 2 UP mgc MGC10.31.7.61@o2ib 9f351a51-0232-4306-a66d-cecee8629329 4
 3 UP osd-zfs storage-MDT0000-osd storage-MDT0000-osd_UUID 9
 4 UP mds MDS MDS_uuid 2
 5 UP lod storage-MDT0000-mdtlov storage-MDT0000-mdtlov_UUID 3
 6 UP mdt storage-MDT0000 storage-MDT0000_UUID 12
 7 UP mdd storage-MDD0000 storage-MDD0000_UUID 3
 8 UP qmt storage-QMT0000 storage-QMT0000_UUID 3
 9 UP osp storage-MDT0001-osp-MDT0000 storage-MDT0000-mdtlov_UUID 4
10 UP osp storage-OST0000-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4
11 UP osp storage-OST0001-osc-MDT0000 storage-MDT0000-mdtlov_UUID 4
12 UP lwp storage-MDT0000-lwp-MDT0000 storage-MDT0000-lwp-MDT0000_UUID 4
13 UP osd-zfs storage-OST0000-osd storage-OST0000-osd_UUID 4
14 UP ost OSS OSS_uuid 2
15 UP obdfilter storage-OST0000 storage-OST0000_UUID 6
16 UP lwp storage-MDT0000-lwp-OST0000 storage-MDT0000-lwp-OST0000_UUID 4
17 UP lwp storage-MDT0001-lwp-OST0000 storage-MDT0001-lwp-OST0000_UUID 4

On io01 I see repeating errors mentioning a network error:

[65922.582578] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) Skipped 
11 previous similar messages
[66494.575431] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) Failed 
to map mr 1/8 elements
[66494.575442] LNetError: 20017:0:(o2iblnd.c:1880:kiblnd_fmr_pool_map()) 
Skipped 11 previous similar messages
[66494.575446] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) Can't 
map 32768 bytes (8/8)s: -22
[66494.575448] LNetError: 20017:0:(o2iblnd_cb.c:613:kiblnd_fmr_map_tx()) 
Skipped 11 previous similar messages
[66494.575452] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) Can't setup 
PUT src for 10.31.7.62@o2ib: -22
[66494.575454] LNetError: 20017:0:(o2iblnd_cb.c:1725:kiblnd_send()) Skipped 11 
previous similar messages
[66494.575458] LustreError: 20017:0:(events.c:477:server_bulk_callback()) event 
type 5, status -5, desc 00000000cdd4e797
[66494.575460] LustreError: 20017:0:(events.c:477:server_bulk_callback()) 
Skipped 11 previous similar messages
[66546.574314] LustreError: 20017:0:(ldlm_lib.c:3540:target_bulk_io()) @@@ network error 
on bulk WRITE  req@0000000070b8f1ab x1740960836990720/t0(0) 
o1000->[email protected]@o2ib:522/0<mailto:[email protected]@o2ib:522/0>
 lens 336/33016 e 0 to 0 dl 1660376137 ref 1 fl Interpret:/0/0 rc 0/0 job:''

On io02 I see repeating errors mentioning a bad log:

[66582.856444] LustreError: 14905:0:(llog_osd.c:264:llog_osd_read_header()) 
storage-MDT0000-osp-MDT0001: bad log  [0x200000401:0x1:0x0] header magic: 0x0 
(expected 0x10645539)
[66582.856450] LustreError: 14905:0:(llog_osd.c:264:llog_osd_read_header()) 
Skipped 11 previous similar messages

I can't make sense of these error messages. How can I recover?

(I have the full dmesg/lctl dk log, but they are too big to attach, is it ok to 
upload them somewhere and put a link in a reply?)

Thank you and best regards,
Stepan



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Volker Rieke
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr. Astrid Lambrecht,
Prof. Dr. Frauke Melchior
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------


Neugierige sind herzlich willkommen am Sonntag, den 21. August 2022, von 10:00 
bis 17:00 Uhr. Mehr unter: https://www.tagderneugier.de
_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to