Hi Colin,

I've added here 3 log file 1 from the metadata and 2 from the object stores.
Before this logs started the filesystem was working, then I requested the
cluster to failover muse-OST0001 from oss01 to oss02.


On Thu, 18 Nov 2021 at 17:11, Colin Faber <[email protected]> wrote:

> Hi Koos,
>
> First thing -- it's generally a bad idea to run newer server versions with
> older clients (the opposite isn't true).
>
> Second -- do you have any logging that you can share from the client
> itself? (dmesg, syslog, etc)
>
> A quick test may be to run 2.12.7 clients against your cluster to verify
> that there is no interop problem.
>
> -cf
>
>
> On Thu, Nov 18, 2021 at 8:58 AM Meijering, Koos via lustre-discuss <
> [email protected]> wrote:
>
>> Hi all,
>>
>> We have build a lustre cluster server environment on CentOS7 and lustre
>> 2.12.7
>> The clients are using 2.12.5
>> The setup is 3 clusters for a 3PB filesystem
>> One cluster is a two node cluster built for MGS and MDT's
>> The other two clusters are also two node cluster used for the OST's
>> The cluster framework is working as expected.
>>
>> The servers are connected in a multirail network, because some clients
>> are in IB and the other clients are on ethernet
>>
>> But we have the following problem. When an OST failover to the
>> second node the clients are unable to contact the OST that is started on
>> the oder node.
>> The OST recovery status is waiting for clients
>> When we fail it back it starts working again and the recovery status is
>> compple
>>
>> We tried to abort the recovery but that does not work.
>>
>> We used these documents to build the cluster:
>> https://wiki.lustre.org/Creating_the_Lustre_Management_Service_(MGS)
>> https://wiki.lustre.org/Creating_the_Lustre_Metadata_Service_(MDS)
>> https://wiki.lustre.org/Creating_Lustre_Object_Storage_Services_(OSS)
>>
>> https://wiki.lustre.org/Creating_Pacemaker_Resources_for_Lustre_Storage_Services
>>
>> I'm not sure what the next steps must be to find the problem and where to
>> look.
>>
>> Best regards
>> Koos Meijering
>> ........................................................................
>> HPC Team
>> Rijksuniversiteit Groningen
>> ........................................................................
>> _______________________________________________
>> lustre-discuss mailing list
>> [email protected]
>> http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
>>
>
Nov 19 11:53:30 dh4-oss01 stonith-ng[4910]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss01 stonith-ng[4910]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss01 Lustre(muse01)[220826]: INFO: Starting to unmount /dev/mapper/muse01
Nov 19 11:53:30 dh4-oss01 kernel: Lustre: Failing over muse-OST0001
Nov 19 11:53:31 dh4-oss01 kernel: Lustre: server umount muse-OST0001 complete
Nov 19 11:53:31 dh4-oss01 Lustre(muse01)[220826]: INFO: /dev/mapper/muse01 unmounted successfully
Nov 19 11:53:31 dh4-oss01 crmd[4914]:  notice: Result of stop operation for muse01 on dh4-oss01: 0 (ok)
Nov 19 11:53:54 dh4-oss01 kernel: LustreError: 137-5: muse-OST0001_UUID: not available for connect from 172.23.53.214@o2ib4 (no target). If you are running an HA pair check that the target is mounted on the other server.
Nov 19 11:53:54 dh4-oss01 kernel: LustreError: Skipped 83 previous similar messages

Nov 19 11:53:30 dh4-oss02 crmd[4901]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
Nov 19 11:53:30 dh4-oss02 stonith-ng[4897]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss02 stonith-ng[4897]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss02 pengine[4900]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss02 pengine[4900]:  notice: Calculated transition 273, saving inputs in /var/lib/pacemaker/pengine/pe-input-152.bz2
Nov 19 11:53:30 dh4-oss02 pengine[4900]:  notice: On loss of CCM Quorum: Ignore
Nov 19 11:53:30 dh4-oss02 pengine[4900]:  notice:  * Move       muse01              ( dh4-oss01 -> dh4-oss02 )
Nov 19 11:53:30 dh4-oss02 pengine[4900]:  notice: Calculated transition 274, saving inputs in /var/lib/pacemaker/pengine/pe-input-153.bz2
Nov 19 11:53:30 dh4-oss02 crmd[4901]:  notice: Initiating stop operation muse01_stop_0 on dh4-oss01
Nov 19 11:53:31 dh4-oss02 crmd[4901]:  notice: Initiating start operation muse01_start_0 locally on dh4-oss02
Nov 19 11:53:31 dh4-oss02 Lustre(muse01)[142345]: INFO: Starting to mount /dev/mapper/muse01
Nov 19 11:53:31 dh4-oss02 kernel: LDISKFS-fs (dm-2): file extents enabled, maximum tree depth=5
Nov 19 11:53:32 dh4-oss02 kernel: LDISKFS-fs (dm-2): mounted filesystem with ordered data mode. Opts: errors=remount-ro,no_mbcache,nodelalloc
Nov 19 11:53:32 dh4-oss02 kernel: Lustre: muse-OST0001: Imperative Recovery enabled, recovery window shrunk from 300-900 down to 150-900
Nov 19 11:53:32 dh4-oss02 kernel: Lustre: muse-OST0001: in recovery but waiting for the first client to connect
Nov 19 11:53:32 dh4-oss02 kernel: Lustre: Skipped 1 previous similar message
Nov 19 11:53:32 dh4-oss02 Lustre(muse01)[142345]: INFO: /dev/mapper/muse01 mounted successfully
Nov 19 11:53:32 dh4-oss02 crmd[4901]:  notice: Result of start operation for muse01 on dh4-oss02: 0 (ok)
Nov 19 11:53:32 dh4-oss02 crmd[4901]:  notice: Initiating monitor operation muse01_monitor_20000 locally on dh4-oss02
Nov 19 11:53:32 dh4-oss02 crmd[4901]:  notice: Transition 274 (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-153.bz2): Complete
Nov 19 11:53:32 dh4-oss02 crmd[4901]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE

Nov 19 11:53:32 dh4-mds02 kernel: LustreError: 11-0: muse-OST0001-osc-MDT0000: operation ost_statfs to node 172.23.53.175@o2ib4 failed: rc = -107
Nov 19 11:53:32 dh4-mds02 kernel: Lustre: muse-OST0001-osc-MDT0000: Connection to muse-OST0001 (at 172.23.53.175@o2ib4) was lost; in progress operations using this service will wait for recovery to complete
Nov 19 11:53:32 dh4-mds02 kernel: Lustre: muse-MDT0000: Connection restored to a7fa3ae3-f879-926d-aeef-f3c62d62dd7e (at 172.23.53.176@o2ib4)
Nov 19 11:53:32 dh4-mds02 kernel: Lustre: Skipped 1 previous similar message

_______________________________________________
lustre-discuss mailing list
[email protected]
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to