One more addition, I also the following message on the oss who had the ost before the failover: Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: not available for connect from 172.23.53.214@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <[email protected]> wrote: > Hi Colin, > > I've added here 3 log file 1 from the metadata and 2 from the object > stores. > Before this logs started the filesystem was working, then I requested the > cluster to failover muse-OST0001 from oss01 to oss02. > > > On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote: > >> Hi Koos, >> >> First thing -- it's generally a bad idea to run newer server versions >> with older clients (the opposite isn't true). >> >> Second -- do you have any logging that you can share from the client >> itself? (dmesg, syslog, etc) >> >> A quick test may be to run 2.12.7 clients against your cluster to verify >> that there is no interop problem. >> >> -cf >> >> >> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss < >> [email protected]> wrote: >> >>> Hi all, >>> >>> We have build a lustre cluster server environment on CentOS7 and lustre >>> 2.12.7 >>> The clients are using 2.12.5 >>> The setup is 3 clusters for a 3PB filesystem >>> One cluster is a two node cluster built for MGS and MDT's >>> The other two clusters are also two node cluster used for the OST's >>> The cluster framework is working as expected. >>> >>> The servers are connected in a multirail network, because some clients >>> are in IB and the other clients are on ethernet >>> >>> But we have the following problem. When an OST failover to the >>> second node the clients are unable to contact the OST that is started on >>> the oder node. >>> The OST recovery status is waiting for clients >>> When we fail it back it starts working again and the recovery status is >>> compple >>> >>> We tried to abort the recovery but that does not work. >>> >>> We used these documents to build the cluster: >>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS) >>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS) >>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS) >>> >>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services >>> >>> I'm not sure what the next steps must be to find the problem and where >>> to look. >>> >>> Best regards >>> Koos Meijering >>> ........................................................................ >>> HPC Team >>> Rijksuniversiteit Groningen >>> ........................................................................ >>> _______________________________________________ >>> lustre-discuss mailing list >>> [email protected] >>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>> >>
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
