Hi Colin, I've added here 3 log file 1 from the metadata and 2 from the object stores. Before this logs started the filesystem was working, then I requested the cluster to failover muse-OST0001 from oss01 to oss02.
On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote: > Hi Koos, > > First thing -- it's generally a bad idea to run newer server versions with > older clients (the opposite isn't true). > > Second -- do you have any logging that you can share from the client > itself? (dmesg, syslog, etc) > > A quick test may be to run 2.12.7 clients against your cluster to verify > that there is no interop problem. > > -cf > > > On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss < > [email protected]> wrote: > >> Hi all, >> >> We have build a lustre cluster server environment on CentOS7 and lustre >> 2.12.7 >> The clients are using 2.12.5 >> The setup is 3 clusters for a 3PB filesystem >> One cluster is a two node cluster built for MGS and MDT's >> The other two clusters are also two node cluster used for the OST's >> The cluster framework is working as expected. >> >> The servers are connected in a multirail network, because some clients >> are in IB and the other clients are on ethernet >> >> But we have the following problem. When an OST failover to the >> second node the clients are unable to contact the OST that is started on >> the oder node. >> The OST recovery status is waiting for clients >> When we fail it back it starts working again and the recovery status is >> compple >> >> We tried to abort the recovery but that does not work. >> >> We used these documents to build the cluster: >> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS) >> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS) >> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS) >> >> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services >> >> I'm not sure what the next steps must be to find the problem and where to >> look. >> >> Best regards >> Koos Meijering >> ........................................................................ >> HPC Team >> Rijksuniversiteit Groningen >> ........................................................................ >> _______________________________________________ >> lustre-discuss mailing list >> [email protected] >> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >> >
Nov 19 11:53:30 dh4-oss01 stonith-ng[4910]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss01 stonith-ng[4910]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss01 Lustre(muse01)[220826]: INFO: Starting to unmount /dev/mapper/muse01 Nov 19 11:53:30 dh4-oss01 kernel: Lustre: Failing over muse-OST0001 Nov 19 11:53:31 dh4-oss01 kernel: Lustre: server umount muse-OST0001 complete Nov 19 11:53:31 dh4-oss01 Lustre(muse01)[220826]: INFO: /dev/mapper/muse01 unmounted successfully Nov 19 11:53:31 dh4-oss01 crmd[4914]: notice: Result of stop operation for muse01 on dh4-oss01: 0 (ok) Nov 19 11:53:54 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: not available for connect from 172.23.53.214@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server. Nov 19 11:53:54 dh4-oss01 kernel: LustreError: Skipped 83 previous similar messages
Nov 19 11:53:30 dh4-oss02 crmd[4901]: notice: State transition S_IDLE -> S_POLICY_ENGINE Nov 19 11:53:30 dh4-oss02 stonith-ng[4897]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss02 stonith-ng[4897]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss02 pengine[4900]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss02 pengine[4900]: notice: Calculated transition 273, saving inputs in /var/lib/pacemaker/pengine/pe-input-152.bz2 Nov 19 11:53:30 dh4-oss02 pengine[4900]: notice: On loss of CCM Quorum: Ignore Nov 19 11:53:30 dh4-oss02 pengine[4900]: notice: * Move muse01 ( dh4-oss01 -> dh4-oss02 ) Nov 19 11:53:30 dh4-oss02 pengine[4900]: notice: Calculated transition 274, saving inputs in /var/lib/pacemaker/pengine/pe-input-153.bz2 Nov 19 11:53:30 dh4-oss02 crmd[4901]: notice: Initiating stop operation muse01_stop_0 on dh4-oss01 Nov 19 11:53:31 dh4-oss02 crmd[4901]: notice: Initiating start operation muse01_start_0 locally on dh4-oss02 Nov 19 11:53:31 dh4-oss02 Lustre(muse01)[142345]: INFO: Starting to mount /dev/mapper/muse01 Nov 19 11:53:31 dh4-oss02 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5 Nov 19 11:53:32 dh4-oss02 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc Nov 19 11:53:32 dh4-oss02 kernel: Lustre: muse-OST0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900 Nov 19 11:53:32 dh4-oss02 kernel: Lustre: muse-OST0001: in recovery but waiting for the first client to connect Nov 19 11:53:32 dh4-oss02 kernel: Lustre: Skipped 1 previous similar message Nov 19 11:53:32 dh4-oss02 Lustre(muse01)[142345]: INFO: /dev/mapper/muse01 mounted successfully Nov 19 11:53:32 dh4-oss02 crmd[4901]: notice: Result of start operation for muse01 on dh4-oss02: 0 (ok) Nov 19 11:53:32 dh4-oss02 crmd[4901]: notice: Initiating monitor operation muse01_monitor_20000 locally on dh4-oss02 Nov 19 11:53:32 dh4-oss02 crmd[4901]: notice: Transition 274 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-153.bz2): Complete Nov 19 11:53:32 dh4-oss02 crmd[4901]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Nov 19 11:53:32 dh4-mds02 kernel: LustreError: 11-0: muse-OST0001-osc-MDT0000: operation ost_statfs to node 172.23.53.175@o2ib4 failed: rc = -107 Nov 19 11:53:32 dh4-mds02 kernel: Lustre: muse-OST0001-osc-MDT0000: Connection to muse-OST0001 (at 172.23.53.175@o2ib4) was lost; in progress operations using this service will wait for recovery to complete Nov 19 11:53:32 dh4-mds02 kernel: Lustre: muse-MDT0000: Connection restored to a7fa3ae3-f879-926d-aeef-f3c62d62dd7e (at 172.23.53.176@o2ib4) Nov 19 11:53:32 dh4-mds02 kernel: Lustre: Skipped 1 previous similar message
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
