Re: [gpfsug-discuss] How to get rid of very old mmhealth events

Dorigo Alvise (PSI) Thu, 05 Jul 2018 01:29:25 -0700

Hello Daniel,
I've solved my problem disabling the check (I've gpfs v4.2.3-5) by putting


ib_rdma_enable_monitoring=False

in the [network] section of the file /var/mmfs/mmsysmon/mmsysmonitor.conf, and 
restarting the mmsysmonitor.

There was a thread in this group about this problem.

   A

________________________________
From: [email protected] 
[[email protected]] on behalf of Yaron Daniel 
[[email protected]]
Sent: Sunday, July 01, 2018 7:17 PM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] How to get rid of very old mmhealth events

Hi

There is was issue with Scale 5.x GUI error - ib_rdma_nic_unrecognized(mlx5_0/2)

Check if you have the patch:

[root@gssio1 ~]#  diff /usr/lpp/mmfs/lib/mmsysmon/NetworkService.py 
/tmp/NetworkService.py
229c229,230
<         recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: 
(\w+/\d+)/\d+\n", mmfsadm))
---
>         #recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: 
> (\w+/\d+)/\d+\n", mmfsadm))
>          recognizedNICs = set(re.findall(r"verbsConnectPorts\[\d+\] +: 
> (\w+/\d+)/\d+/\d+\n", mmfsadm))


And restart the - mmsysmoncontrol restart

Regards

________________________________



Yaron Daniel     94 Em Ha'Moshavot Rd
[cid:_1_0B5B5F080B5B5954005EFD8BC22582BD]

Storage Architect – IL Lab Services (Storage)    Petach Tiqva, 49527
IBM Global Markets, Systems HW Sales     Israel

Phone:  +972-3-916-5672
Fax:    +972-3-916-5672
Mobile: +972-52-8395593
e-mail: [email protected]
IBM Israel<http://www.ibm.com/il/he/>



[IBM Storage Strategy and Solutions v1][IBM Storage Management and Data 
Protection 
v1][cid:_1_06EDAF6406EDA744005EFD8BC22582BD][cid:_1_06EDB16C06EDA744005EFD8BC22582BD]
 
[https://acclaim-production-app.s3.amazonaws.com/images/6c2c3858-6df8-45be-ac2b-f93b8da74e20/Data%2BDriven%2BMulti%2BCloud%2BStrategy%2BV1%2Bver%2B4.png]
       [Related image]



From:        "Andrew Beattie" <[email protected]>
To:        [email protected]
Date:        06/28/2018 11:16 AM
Subject:        Re: [gpfsug-discuss] How to get rid of very old mmhealth events
Sent by:        [email protected]
________________________________



Do you know if there is actually a cable plugged into port 2?

The system will work fine as long as there is network connectivity, but you may 
have an issue with redundancy or loss of bandwidth if you do not have every 
port cabled and configured correctly.

Regards
Andrew Beattie
Software Defined Storage  - IT Specialist
Phone: 614-2133-7927
E-mail: [email protected]<mailto:[email protected]>


----- Original message -----
From: "Dorigo Alvise (PSI)" <[email protected]>
Sent by: [email protected]
To: "[email protected]" <[email protected]>
Cc:
Subject: [gpfsug-discuss] How to get rid of very old mmhealth events
Date: Thu, Jun 28, 2018 6:08 PM

Dear experts,
I've e GL2 IBM system running SpectrumScale v4.2.3-6 (RHEL 7.3).
The system is working properly but I get a DEGRADED status report for the 
NETWORK running the command mmhealth:

[root@sf-gssio1 ~]# mmhealth node show

Node name:      sf-gssio1.psi.ch
Node status:    DEGRADED
Status Change:  23 min. ago

Component       Status        Status Change     Reasons
-------------------------------------------------------------------------------------------------------------------------------------------
GPFS            HEALTHY       22 min. ago       -
NETWORK         DEGRADED      145 days ago      ib_rdma_link_down(mlx5_0/2), 
ib_rdma_nic_down(mlx5_0/2), ib_rdma_nic_unrecognized(mlx5_0/2)
[...]

This event is clearly an outlier because the network, verbs and IB are 
correctly working:

[root@sf-gssio1 ~]# mmfsadm test verbs status
VERBS RDMA status: started

[root@sf-gssio1 ~]# mmlsconfig verbsPorts|grep gssio1
verbsPorts mlx5_0/1 [sf-ems1,sf-gssio1,sf-gssio2]

[root@sf-gssio1 ~]# mmdiag --config|grep verbsPorts
! verbsPorts mlx5_0/1

[root@sf-gssio1 ~]# ibstat  mlx5_0
CA 'mlx5_0'
   CA type: MT4113
   Number of ports: 2
   Firmware version: 10.16.1020
   Hardware version: 0
   Node GUID: 0xec0d9a03002b5db0
   System image GUID: 0xec0d9a03002b5db0
   Port 1:
       State: Active
       Physical state: LinkUp
       Rate: 56
       Base lid: 42
       LMC: 0
       SM lid: 1
       Capability mask: 0x26516848
       Port GUID: 0xec0d9a03002b5db0
       Link layer: InfiniBand
   Port 2:
       State: Down
       Physical state: Disabled
       Rate: 10
       Base lid: 65535
       LMC: 0
       SM lid: 0
       Capability mask: 0x26516848
       Port GUID: 0xec0d9a03002b5db8
       Link layer: InfiniBand

That event is there since 145 days and I didn't go away after a daemon restart 
(mmshutdown/mmstartup).
My question is: how I can get rid of this event and restore the mmhealth's 
output to HEALTHY ? This is important because I've nagios sensors that 
periodically parse the "mmhealth -Y ..." output and at the moment I've to 
disable their email notification (which is not good if some real bad event 
happens).

Thanks,

  Alvise
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] How to get rid of very old mmhealth events

Reply via email to