Re: [gpfsug-discuss] Wouldn't you like to know if you had filesystem corruption?

Mathias Dietz Thu, 25 Jan 2024 09:51:48 -0800

Hi Kevin,

I think there is some misconception about how FSStruct errors are detected and 
handled.


All nodes in a Storage Scale cluster have a health monitoring daemon running 
(backend for mmhealth cmd) which monitors the individual components and listens 
to callbacks to detect issues like FSStruct errors.
As you correctly mentioned, the FSStruct callbacks will be fired on the 
Filesystem-Manager nodes only and therefore raise a new mmhealth event on that 
node.
You can see those events running mmhealth node show  on that node.

Irrespective of the fact if this is an EMS node or an IO node, mmhealth will 
forward any event to the cluster manager to provide a consolidated cluster wide 
state view (mmhealth cluster show)
In addition, all events will be forwarded to the GUI, which will show those 
events as alerts.

Since many customers have their own monitoring system we provide multiple ways 
to get notified about new events:

  *   Scale GUI allows to configure Email notifications or SNMP traps
https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=gui-event-notifications


  *   mmhealth offers a modern webhook interface
https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=command-configuring-webhook-by-using-mmhealth

  *   mmhealth can call user defined scripts to trigger any custom notification 
tool

      
https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=mhn-running-user-defined-script-when-event-is-raised


  *
3rd party monitoring tools can use the REST API or mmhealth CLIs to poll the 
system status
https://www.ibm.com/docs/en/storage-scale/5.1.9?topic=endpoints-nodesnamehealthstates-get


Depending on which option you choose and where your external monitoring system 
is running you need to ensure that there is a network route to the system.
(e.g. GUI Email & SNMP need the EMS node to talk to the server, webhook/custom 
script will need any node to talk to the server)
ESS IO nodes are not necessarily restricted to an internal network. We have 
many customers who attach their ESS to their campus network for central 
management and monitoring.

If you have further questions or want to hear more about monitoring & 
notifications, I can offer to schedule a webex session with you.

best regards

Mathias Dietz

Storage Scale RAS Architect

IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Wolfgang Wendt
Geschäftsführung: David Faller
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 
243294
________________________________
From: gpfsug-discuss <[email protected]> on behalf of 
Buterbaugh, Kevin Lynn <[email protected]>
Sent: Wednesday, January 24, 2024 6:08 PM
To: [email protected] <[email protected]>
Subject: [EXTERNAL] [gpfsug-discuss] Wouldn't you like to know if you had 
filesystem corruption?

Hi All, Wouldn’t you like to know if your IBM ESS had filesystem corruption? If 
you answered “no” my guess is that you’ve never experienced undetected 
filesystem corruption! 😉 Did you know that if you’ve got an IBM ESS set up in 
its’
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
<https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1e-vj57TRvmblYv7WEGlTyl-6uQgcYUxF43AYmn9-6xcyd2yY_MtvZDlrg2Di5JLebKYvKUUvexGG-EpyPBOUIr8FvGdJaWEygZuYipw0dPWvntXnCGvGfc9soNB8j4bka7QDhM0qg$>
Report Suspicious

ZjQcmQRYFpfptBannerEnd

Hi All,



Wouldn’t you like to know if your IBM ESS had filesystem corruption?  If you 
answered “no” my guess is that you’ve never experienced undetected filesystem 
corruption!  😉



Did you know that if you’ve got an IBM ESS set up in its’ default 
configuration, which also matches the recommended configuration in every last 
piece of IBM documentation that I’ve ever come across, you WILL NOT be notified 
of filesystem corruption?!?



Do you think IBM should fix this ASAP?  If so, please up vote 
https://ideas.ibm.com/ideas/ESS-I-61.



If you, like me, consider this a bug in the existing product and not a “feature 
enhancement” to maybe be included in some future release if we’re lucky, then 
please keep reading.



Here’s the gory details to the best of my understanding…



Your IBM ESS can and will detect filesystem corruption (FS_STRUCT errors).  But 
it currently will NOT, and cannot, let you know that it’s happened.  The reason 
is that FS_STRUCT errors are detected only on the filesystem manager node, 
which makes sense.  But if you’re running in the default and recommended 
configuration your filesystem manager node is one of the I/O nodes, not the EMS 
node.  The I/O nodes have no way to communicate anything out to you unless IBM 
decides to configure them to do so – like they ALREADY DO with other things 
like hardware events – by routing the error thru the EMS node which can send it 
on to you.



You could fix this problem yourself by writing a custom callback script to send 
you an e-mail (or a text) whenever an FS_STRUCT error is detected by the 
filesystem manager node … EXCEPT that you’d need mailx / postfix or something 
like that and IBM doesn’t provide you with a way to install them on the I/O 
nodes.  As an aside, if you’re NOT on an ESS (i.e. running GPFS on some sort of 
commodity hardware) you can and should do this!



There is a workaround for this issue, which is to run your filesystem 
manager(s) on the EMS node.  However, 1) this goes against IBM’s 
recommendations (and defaults), and 2) is not possible for larger ESS systems 
as the EMS node doesn’t have enough RAM to handle the filesystem manager 
function.



Personally, I think it’s absolutely crazy that an I/O node can tell you that 
you’ve got a pdisk failure but can’t tell you that you’ve got filesystem 
corruption!  If you agree, then please up vote the RFE above.



<rant>

Even if you don’t agree, let me ask you to consider up voting the RFE anyway.  
Why?  To send a message to IBM that you consider it unacceptable for them to 
allow a customer (me, obviously) to open up a support ticket for this very 
issue (again, I consider this a very serious bug, not a feature enhancement) in 
July of 2023, work with the customer for 6 months, and then blow the customer 
off by telling them, and I quote:



“As per the dev team, this feature has been in this way since really old 
versions and has not changed which means that is not going to change soon.  You 
can request an RFE with your idea for the development team to take it into 
account. Below I share the link where you can share your idea (RFE):”



“Not going to change soon.”  Thanks for nothing, IBM … well, I do appreciate 
your honesty.  I’ve got one other RFE out there - submitted in August of 2022 - 
and its’ status is still “Future Consideration.”  I guess I’ll just keep my 
fingers crossed that I never have filesystem corruption on an ESS.  But if I 
do, let me highly recommend to you that you not assign me one of your support 
personnel who does not understand that 1 plus 4 does not equal 6 … or that 
October comes before November on the calendar (both of which I have actually 
had happen to me in the last 6 months; no, sadly, I am not joking or 
exaggerating in the least).



To all the IBMers reading this I want you to know that I personally consider 
the ESS and GPFS to be the best storage solution out there from a technical 
perspective … I truly do.  But that is rapidly becoming irrelevant when you are 
also doing things like the above, especially when you are overly proud (I think 
you know what I mean) of your support even if it was good, which it used to be 
but sadly no longer is.



IBMers, I’m sure you don’t like this bit of public shaming.  Guess what?  I 
don’t like doing it.  But I have complained directly to IBM about these things 
for quite some time now (ask my sales rep if you don’t believe me) and it’s 
done no good whatsoever.  Not only did I count to 100 before composing this 
e-mail, I slept on it.  I don’t know what else to do when things aren’t 
changing.  But I promise you this, if you’ll stop doing stuff like this I will 
absolutely be more than glad to never have to send another e-mail like this one 
again.  Deal?

</rant>



Thank you, all…



Kevin B.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Re: [gpfsug-discuss] Wouldn't you like to know if you had filesystem corruption?

Reply via email to