Hi Koos, One thing you mentioned that I should have picked up on sooner, was "The servers are connected in a multirail network, because some clients are in IB and the other clients are on ethernet"
Can you describe your topology? How are the various elements connected to each other? -cf On Fri, Nov 19, 2021 at 5:38 AM Meijering, Koos <[email protected]> wrote: > One more addition, I also the following message on the oss who had the ost > before the failover: > Nov 19 12:43:59 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: > not available for connect from 172.23.53.214@o2ib4 (no target). If you > are running an HA pair check that the target is mounted on the other server. > > On Fri, 19 Nov 2021 at 12:01, Meijering, Koos <[email protected]> wrote: > >> Hi Colin, >> >> I've added here 3 log file 1 from the metadata and 2 from the object >> stores. >> Before this logs started the filesystem was working, then I requested the >> cluster to failover muse-OST0001 from oss01 to oss02. >> >> >> On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote: >> >>> Hi Koos, >>> >>> First thing -- it's generally a bad idea to run newer server versions >>> with older clients (the opposite isn't true). >>> >>> Second -- do you have any logging that you can share from the client >>> itself? (dmesg, syslog, etc) >>> >>> A quick test may be to run 2.12.7 clients against your cluster to verify >>> that there is no interop problem. >>> >>> -cf >>> >>> >>> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss < >>> [email protected]> wrote: >>> >>>> Hi all, >>>> >>>> We have build a lustre cluster server environment on CentOS7 and lustre >>>> 2.12.7 >>>> The clients are using 2.12.5 >>>> The setup is 3 clusters for a 3PB filesystem >>>> One cluster is a two node cluster built for MGS and MDT's >>>> The other two clusters are also two node cluster used for the OST's >>>> The cluster framework is working as expected. >>>> >>>> The servers are connected in a multirail network, because some clients >>>> are in IB and the other clients are on ethernet >>>> >>>> But we have the following problem. When an OST failover to the >>>> second node the clients are unable to contact the OST that is started on >>>> the oder node. >>>> The OST recovery status is waiting for clients >>>> When we fail it back it starts working again and the recovery status is >>>> compple >>>> >>>> We tried to abort the recovery but that does not work. >>>> >>>> We used these documents to build the cluster: >>>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS) >>>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS) >>>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS) >>>> >>>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services >>>> >>>> I'm not sure what the next steps must be to find the problem and where >>>> to look. >>>> >>>> Best regards >>>> Koos Meijering >>>> ........................................................................ >>>> HPC Team >>>> Rijksuniversiteit Groningen >>>> ........................................................................ >>>> _______________________________________________ >>>> lustre-discuss mailing list >>>> [email protected] >>>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org >>>> >>>
_______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
