Re: [gpfsug-discuss] Problem Determination

Bryan Banister Fri, 02 Oct 2015 09:45:17 -0700

I would like to strongly echo what Bob has stated, especially the documentation 
or wrong documentation, and I have in-lining some comments below.


I liken GPFS to a critical care patient at the hospital.  You have to check on 
the state regularly, know the running heart rate (e.g. waiters), the response 
of every component from disk, to networks, to server load, etc.  When a problem 
occurs, running tests (such as nsdperf)  to help isolate the problem quickly is 
crucial.  Capturing GPFS trace data is also very important if the problem isn’t 
obvious.  But then you have to wait for IBM support to parse the information 
and give you their analysis of the situation.  It would be great to get an 
advanced troubleshooting document that describes how to read the output of 
`mmfsadm dump` commands and the GPFS trace report that is generated.

Cheers,
-Bryan

From: [email protected] 
[mailto:[email protected]] On Behalf Of Oesterlin, Robert
Sent: Thursday, October 01, 2015 7:39 AM
To: gpfsug main discussion list
Subject: Re: [gpfsug-discuss] Problem Determination

Hi Patrick

I was going to mail you directly – but this may help spark some discussion in 
this area.  GPFS (pardon the use of the “old school" term – You need something 
easier to type that Spectrum Scale) problem determination is one of those areas 
that is (sometimes) more of an art than a science. IBM publishes a PD guide, 
and it’s a good start but doesn’t cover all the bases.

- In the GPFS log (/var/mmfs/gen/mmfslog) there are a lot of messages 
generated. I continue to come across ones that are not documented – or 
documented poorly. EVERYTHING that ends up in ANY log needs to be documented.
- The PD guide gives some basic things to look at for many of the error 
messages, but doesn’t go into alternative explanation for many errors. Example: 
When a node gets expelled, the PD guide tells you it’s a communication issue, 
when it fact in may be related to other things like Linux network tuning. 
Covering all the possible causes is hard, but you can improve this.
- GPFS waiter information – understanding and analyzing this is key to getting 
to the bottom of many problems. The waiter information is not well documented. 
You should include at least a basic guide on how to use waiter information in 
determining cluster problems. Related: Undocumented config options. You can 
come across some by doing “mmdiag —config”. Using some of these can help you – 
or get you in trouble in the long run. If I can see the option, document it.
                [Bryan: Also please, please provide a way to check whether or 
not the configuration parameters need to be changed.  I assume that there is a 
`mmfsadm dump` command that can tell you whether the config parameter needs to 
be changed, if not make one!  Just stating something like “This could be 
increased to XX value for very large clusters” is not very helpful.

- Make sure that all information I might come across online is accurate, 
especially on those sites managed by IBM. The Developerworks wiki has great 
information, but there is a lot of information out there that’s out of date or 
inaccurate. This leads to confusion.
                [Bryan: I know that Scott Fadden is a busy man, so I would 
recommend helping distribute the workload of maintaining the wiki 
documentation.  This data should be reviewed on a more regular basis, at least 
once for each major release I  would hope, and updated or deleted if found to 
be out of date.]

- The automatic deadlock detection implemented in 4.1 can be useful, but it 
also can be problematic in a large cluster when you get into problems. Firing 
off traces and taking dumps in an automated manner  can cause more problems if 
you have a large cluster. I ended up turning it off.
                [Bryan: From what I’ve heard, IBM is actively working to make 
the deadlock amelioration logic better.  I agree that firing off traces can 
cause more problems, and we have turned off the automated collection as well.  
We are going to work on enabling the collection of some data during these 
events to help ensure we get enough data for IBM to analyze the problem.]

- GPFS doesn’t have anything setup to alert you when conditions occur that may 
require your attention. There are some alerting capabilities that you can 
customize, but something out of the box might be useful. I know there is work 
going on in this area.
                [Bryan: The GPFS callback facilities are very useful for 
setting up alerts, but not well documented or advertised by the GPFS manuals.  
I hope to see more callback capabilities added to help monitor all aspects of 
the GPFS cluster and file systems]


mmces – I did some early testing on this but haven’t had a chance to upgrade my 
protocol nodes to the new level. Upgrading 1000’s of node across many cluster 
is – challenging :-) The newer commands are a great start. I like the ability 
to list out events related to a particular protocol.

I could go on… Feel free to contact me directly for a more detailed discussion: 
robert.oesterlin @ nuance.com

Bob Oesterlin
Sr Storage Engineer, Nuance Communications

From: 
<[email protected]<mailto:[email protected]>> 
on behalf of Patrick Byrne
Reply-To: gpfsug main discussion list
Date: Thursday, October 1, 2015 at 5:09 AM
To: "[email protected]<mailto:[email protected]>"
Subject: [gpfsug-discuss] Problem Determination

Hi all,

As I'm sure some of you aware, problem determination is an area that we are 
looking to try and make significant improvements to over the coming releases of 
Spectrum Scale. To help us target the areas we work to improve and make it as 
useful as possible I am trying to get as much feedback as I can about different 
problems users have, and how people go about solving them.

I am interested in hearing everything from day to day annoyances to problems 
that have caused major frustration in trying to track down the root cause. 
Where possible it would be great to hear how the problems were dealt with as 
well, so that others can benefit from your experience. Feel free to reply to 
the mailing list - maybe others have seen similar problems and could provide 
tips for the future - or to me directly if you'd prefer 
([email protected]<mailto:[email protected]>).

On a related note, in 4.1.1 there was a component added that monitors the state 
of the various protocols that are now supported (NFS, SMB, Object). The output 
from this is available with the 'mmces state' and 'mmces events' CLIs and I 
would like to get feedback from anyone who has had the chance make use of this. 
Is it useful? How could it be improved? We are looking at the possibility of 
extending this component to cover more than just protocols, so any feedback 
would be greatly appreciated.

Thanks in advance,

Patrick Byrne
IBM Spectrum Scale - Development Engineer
IBM Systems - Manchester Lab
IBM UK Limited


________________________________

Note: This email is for the confidential use of the named addressee(s) only and 
may contain proprietary, confidential or privileged information. If you are not 
the intended recipient, you are hereby notified that any review, dissemination 
or copying of this email is strictly prohibited, and to please notify the 
sender immediately and destroy this email and any attachments. Email 
transmission cannot be guaranteed to be secure or error-free. The Company, 
therefore, does not make any guarantees as to the completeness or accuracy of 
this email or any attachments. This email is for informational purposes only 
and does not constitute a recommendation, offer, request or solicitation of any 
kind to buy, sell, subscribe, redeem or perform any type of transaction of a 
financial product.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] Problem Determination

Reply via email to