Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 113, Issue 19

2021-06-22 Thread Saula, Oluwasijibomi
Simon,

Thanks for the quick response and related information! We are at least at 
v5.0.5 so we shouldn't see much exposure to this issue then.


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Tuesday, June 22, 2021 10:56 AM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 113, Issue 19

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. GPFS bad at memory-mapped files? (Saula, Oluwasijibomi)
   2. Re: GPFS bad at memory-mapped files? (Simon Thompson)


--

Message: 1
Date: Tue, 22 Jun 2021 15:17:16 +0000
From: "Saula, Oluwasijibomi" 
To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] GPFS bad at memory-mapped files?
Message-ID:



Content-Type: text/plain; charset="windows-1252"

Hello,

While reviewing AMS software suite for installation, I noticed this remark 
(https://www.scm.com/doc/Installation/Additional_Information_and_Known_Issues.html#gpfs-file-system):

-

GPFS file 
system<https://www.scm.com/doc/Installation/Additional_Information_and_Known_Issues.html#gpfs-file-system>

Starting with AMS2019, the KF sub-system (used for handling binary files such 
as ADF?s TAPE* files) has been rewritten to use memory-mapped files. The mmap() 
system call implementation is file-system dependent and, unfortunately, it is 
not equally efficient in different file systems. The memory-mapped files 
implementation in GPFS is extremely inefficient. Therefore the users should 
avoid using a GPFS for scratch files



Is this claim true? Are there caveats to this statement, if true?


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20210622/3175bbda/attachment-0001.html>

--

Message: 2
Date: Tue, 22 Jun 2021 15:55:54 +
From: Simon Thompson 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] GPFS bad at memory-mapped files?
Message-ID: <14ccbde7-ab03-456b-806b-6ad1a8270...@bham.ac.uk>
Content-Type: text/plain; charset="utf-8"

There certainly *were* issues.

See for example: 
http://files.gpfsug.org/presentations/2018/London/6_GPFSUG_EBI.pdf
And the follow on IBM talk on the same day: 
http://files.gpfsug.org/presentations/2018/London/6_MMAP_V2.pdf

And also from this year: 
https://www.spectrumscaleug.org/event/ssugdigital-spectrum-scale-expert-talks-update-on-performance-enhancements-in-spectrum-scale/

So it may have been true. If it still is, maybe, but it will depend on your 
GPFS code.

Simon

From:  on behalf of "Saula, 
Oluwasijibomi" 
Reply to: "gpfsug-discuss@spectrumscale.org" 
Date: Tuesday, 22 June 2021 at 16:17
To: "gpfsug-discuss@spectrumscale.org" 
Subject: [gpfsug-discuss] GPFS bad at memory-mapped files?

Hello,

While reviewing AMS software suite for installation, I noticed this remark 
(https://www.scm.com/doc/Installation/Additional_Information_and_Known_Issues.html#gpfs-file-system):

-

GPFS file system

Starting with AMS2019, the KF sub-system (used for handling binary files such 
as ADF?s TAPE* files) has been rewritten to use memory-mapped files. The mmap() 
system call implementation is file-system dependent and, unfortunately, it is 
not equally efficient in different file systems. The memory-mapped files 
implementation in GPFS is extremely inefficient. Therefore the users should 
avoid using a GPFS for scratch files


Is this claim true? Are there caveats to this statement, if true?

Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-- next part 

[gpfsug-discuss] GPFS bad at memory-mapped files?

2021-06-22 Thread Saula, Oluwasijibomi
Hello,

While reviewing AMS software suite for installation, I noticed this remark 
(https://www.scm.com/doc/Installation/Additional_Information_and_Known_Issues.html#gpfs-file-system):

-

GPFS file 
system

Starting with AMS2019, the KF sub-system (used for handling binary files such 
as ADF’s TAPE* files) has been rewritten to use memory-mapped files. The mmap() 
system call implementation is file-system dependent and, unfortunately, it is 
not equally efficient in different file systems. The memory-mapped files 
implementation in GPFS is extremely inefficient. Therefore the users should 
avoid using a GPFS for scratch files



Is this claim true? Are there caveats to this statement, if true?


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Long IO waiters and IBM Storwize V5030

2021-05-28 Thread Saula, Oluwasijibomi
Hi Folks,

So, we are experiencing some very long IO waiters in our GPFS cluster:


#  mmdiag --waiters


=== mmdiag: waiters ===

Waiting 17.3823 sec since 10:41:01, monitored, thread 21761 NSDThread: for I/O 
completion

Waiting 16.6140 sec since 10:41:02, monitored, thread 21730 NSDThread: for I/O 
completion

Waiting 15.3004 sec since 10:41:03, monitored, thread 21763 NSDThread: for I/O 
completion

Waiting 15.2013 sec since 10:41:03, monitored, thread 22175

However, GPFS support is pointing to our IBM Storwize V5030 disk system as the 
source of latency. Unfortunately, we don't have paid support for the system so 
we are polling for anyone who might be able to assist.

Does anyone by chance have any experience with IBM Storwize V5030 or possess a 
problem determination guide for the V5030?

We've briefly reviewed the V5030 management portal, but we still haven't 
identified a cause for the increased latencies (i.e. read ~129ms, write ~198ms).

Granted, we have some heavy client workloads, yet we seem to experience this 
drastic drop in performance every couple of months, probably exacerbated by 
heavy IO demands.

Any assistance would be much appreciated.



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 110, Issue 34

2021-03-30 Thread Saula, Oluwasijibomi
Hey Olaf,

We'll investigate as suggested. I'm hopeful the journald logs would provide 
some additional insight.

As for OFED versions, we use the same Mellanox version across the cluster and 
haven't seen any issues with working nodes that mount the filesystem.

We also have a PMR open with IBM but we'll send a follow-up if we discover 
something more for group discussion.



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Tuesday, March 30, 2021 1:07 AM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 110, Issue 34

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Filesystem mount attempt hangs GPFS client node
  (Saula, Oluwasijibomi)
   2. Re: Filesystem mount attempt hangs GPFS client node (Olaf Weiser)


--

Message: 1
Date: Mon, 29 Mar 2021 18:38:00 +0000
From: "Saula, Oluwasijibomi" 
To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] Filesystem mount attempt hangs GPFS client
node
Message-ID:



Content-Type: text/plain; charset="utf-8"

Hello Folks,

So we are experiencing a mind-boggling issue where just a couple of nodes in 
our cluster, at GPFS boot up, get hung so badly that the node must be power 
reset.

These AMD client nodes are diskless in nature and have at least 256G of memory. 
We have other AMD nodes that are working just fine in a separate GPFS cluster 
albeit on RHEL7.

Just before GPFS (or related processes) seize up the node, the following lines 
of /var/mmfs/gen/mmfslog are noted:


2021-03-29_12:47:37.343-0500: [N] mmfsd ready

2021-03-29_12:47:37.426-0500: mmcommon mmfsup invoked. Parameters: 10.12.50.47 
10.12.50.242 all

2021-03-29_12:47:37.587-0500: mounting /dev/mmfs1

2021-03-29_12:47:37.590-0500: [I] Command: mount mmfs1

2021-03-29_12:47:37.859-0500: [N] Connecting to 10.12.50.243 
tier1-sn-02.pixstor 

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01.pixstor) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.866-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.867-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.868-0500: [I] Connected to 10.12.50.243 tier1-sn-02 

There have been hunches that this might be a network issue, however, other 
nodes connected to the IB network switch are mounting the filesystem without 
incident.

I'm inclined to believe there's a GPFS/OS-specific setting that might be 
causing these crashes especially when we note that disabling the automount on 
the client node doesn't result in the node hanging. However, once we issue 
mmmount, we see the node seize up shortly...

Please let me know if you have any thoughts on where to look for root-causes as 
I and a few fellows are stuck here ?



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20210329/4ce36267/attachment-0001.html>

--

Message: 2
Date: Tue, 30 Mar 2021 06:06:54 +
From: "Olaf Weiser" 
To: gpfsug-discuss@spectrumscale.org
Cc: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] Filesystem mount attempt hangs GPFS
client node
Message-ID:



Content-Type: text/plain; charset="us-ascii"

An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20210330/ae3c3cdd/attachment.html>

--

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.

[gpfsug-discuss] Filesystem mount attempt hangs GPFS client node

2021-03-29 Thread Saula, Oluwasijibomi
Hello Folks,

So we are experiencing a mind-boggling issue where just a couple of nodes in 
our cluster, at GPFS boot up, get hung so badly that the node must be power 
reset.

These AMD client nodes are diskless in nature and have at least 256G of memory. 
We have other AMD nodes that are working just fine in a separate GPFS cluster 
albeit on RHEL7.

Just before GPFS (or related processes) seize up the node, the following lines 
of /var/mmfs/gen/mmfslog are noted:


2021-03-29_12:47:37.343-0500: [N] mmfsd ready

2021-03-29_12:47:37.426-0500: mmcommon mmfsup invoked. Parameters: 10.12.50.47 
10.12.50.242 all

2021-03-29_12:47:37.587-0500: mounting /dev/mmfs1

2021-03-29_12:47:37.590-0500: [I] Command: mount mmfs1

2021-03-29_12:47:37.859-0500: [N] Connecting to 10.12.50.243 
tier1-sn-02.pixstor 

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01.pixstor) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.864-0500: [I] VERBS RDMA connecting to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.866-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 0

2021-03-29_12:47:37.867-0500: [I] VERBS RDMA connected to 10.12.50.242 
(tier1-sn-01) on mlx5_0 port 1 fabnum 0 sl 0 index 1

2021-03-29_12:47:37.868-0500: [I] Connected to 10.12.50.243 tier1-sn-02 

There have been hunches that this might be a network issue, however, other 
nodes connected to the IB network switch are mounting the filesystem without 
incident.

I'm inclined to believe there's a GPFS/OS-specific setting that might be 
causing these crashes especially when we note that disabling the automount on 
the client node doesn't result in the node hanging. However, once we issue 
mmmount, we see the node seize up shortly...

Please let me know if you have any thoughts on where to look for root-causes as 
I and a few fellows are stuck here 



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 3

2020-11-04 Thread Saula, Oluwasijibomi
Could someone share the password for the event today? Thanks!


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Wednesday, November 4, 2020 6:00 AM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 106, Issue 3

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. SSUG::Digital Scalable multi-node training for AI workloads
  on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale
  (Simon Thompson)
   2. Re: Alternative to Scale S3 API. (Andi Christiansen)


--

Message: 1
Date: Tue, 3 Nov 2020 17:00:54 +
From: Simon Thompson 
To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training
for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum
Scale
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t 
get sent!



Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI 
workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the 
example of ResNet-50 and the segmentation of images using the Audi A2D2 
dataset. The project team published an IBM Redpaper with all the technical 
details and will present the key learnings and results.


>>>JoinHerehttps://ibm.webex.com/ibm/onstage/g.php?MTID=e896290a1eef7e81ab4b411669138a17e>


This episode will start 15 minutes later as usual.


   *   San Francisco, USA at 08:15 PST

   *   New York, USA at 11:15 EST

   *   London, United Kingdom at 16:15 GMT

   *   Frankfurt, Germany at 17:15 CET

   *   Pune, India at 21:45 IST


-- next part --
An HTML attachment was scrubbed...
URL: 

-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: text/calendar
Size: 2488 bytes
Desc: not available
URL: 


--

Message: 2
Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET)
From: Andi Christiansen 
To: gpfsug main discussion list ,
Christian Vieser 
Subject: Re: [gpfsug-discuss] Alternative to Scale S3 API.
Message-ID: <1512108314.679947.1604474081...@privateemail.com>
Content-Type: text/plain; charset="utf-8"

Hi Christian,

Thanks for the information! My question also triggered IBM to tell me the same 
so i think we will stay on S3 with Scale and hoping the same with the new 
release..

Yes, MinIO is really lacking some good documentation.. but definatly a cool 
software package that i will keep an eye on in the future...

Best Regards
Andi Christiansen


> On 11/02/2020 2:44 PM Christian Vieser  wrote:
>
>
>
> Hi Andi,
>
> we suffer from the same issue. IBM support told me that Spectrum Scale 
> 5.1 will come with a new release of the underlying Openstack components, so 
> we still hope that some/most of limitations will vanish then. But I already 
> know, that the new S3 policies won't be available, only the "legacy" S3 ACLs.
>
> We also tried MinIO but deemed that it's not "production ready". It's 
> fine for quickly setting up a S3 service for development, but they release 
> too often and with breaking changes, and documentation is lacking all aspects 
> regarding maintenance.
>
> Regards,
>
> Christian
>
> Am 27.10.20 um 12:46 schrieb Andi Christiansen:
>
> > > Hi all,
> >
> > We have over a longer period used the S3 API within spectrum 
> > Scale.. And that has shown that it does not support very many applications 
> > because of limitations of the API..
> >
> > > ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-- next part --
An HTML attachment was scrubbed...
URL: 

Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 104, Issue 19

2020-09-19 Thread Saula, Oluwasijibomi
Ryan,

I appreciate your support - I finally got some on a WebEx now.

I'll share any useful information I glean from the session.

Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator / Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu>


From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Saturday, September 19, 2020 6:45:47 PM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 104, Issue 19

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Re: gpfsug-discuss Digest, Vol 104, Issue 18
  (Saula, Oluwasijibomi)
   2. Re: gpfsug-discuss Digest, Vol 104, Issue 18 (Ryan Novosielski)


--

Message: 1
Date: Sat, 19 Sep 2020 20:52:19 +0000
From: "Saula, Oluwasijibomi" 
To: "gpfsug-discuss@spectrumscale.org"

Subject: Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 104, Issue 18
Message-ID:



Content-Type: text/plain; charset="us-ascii"

Ryan,

We've been at severity 1 since about 4am with only a single response all day.

Got me a bit concerned now...


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Saturday, September 19, 2020 3:23 PM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 104, Issue 18

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. CCR errors (Saula, Oluwasijibomi)
   2. Re: CCR errors (Ryan Novosielski)


--

Message: 1
Date: Sat, 19 Sep 2020 20:11:31 +
From: "Saula, Oluwasijibomi" 
To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] CCR errors
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello,

Anyone available to assist with CCR errors:



[root@nsd02 ~]# mmchnode -N nsd04-ib --quorum --manager

mmchnode: Unable to obtain the GPFS configuration file lock. Retrying ...


Per IBM support's direction, I already followed the Manual Repair 
Procedure<https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.3/com.ibm.spectrum.scale.v5r03.doc/bl1pdg_noccrreco_multinode.htm>,
 but now I'm back to square one with the same issue.

Also, I have a ticket over to IBM for troubleshooting this issue during our 
downtime this weekend, but support's response is really slow.

If possible, I'd prefer a webex session to facilitate closure, but I can send 
emails back and forth if necessary.


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20200919/f163918c/attachment-0001.html>

--

Message: 2
Date: Sat, 19 Sep 2020 20:23:01 +
From: Ryan Novosielski 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CCR errors
Message-ID: 
Content-Type: text/plain; charset="utf-8"

I find them to be pretty fast and very experienced at severity 1. Don?t 
hesitate to use it (unless that?s already where you?re at).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>

Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 104, Issue 18

2020-09-19 Thread Saula, Oluwasijibomi
Ryan,

We've been at severity 1 since about 4am with only a single response all day.

Got me a bit concerned now...


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Saturday, September 19, 2020 3:23 PM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 104, Issue 18

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. CCR errors (Saula, Oluwasijibomi)
   2. Re: CCR errors (Ryan Novosielski)


--

Message: 1
Date: Sat, 19 Sep 2020 20:11:31 +0000
From: "Saula, Oluwasijibomi" 
To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] CCR errors
Message-ID:



Content-Type: text/plain; charset="iso-8859-1"

Hello,

Anyone available to assist with CCR errors:



[root@nsd02 ~]# mmchnode -N nsd04-ib --quorum --manager

mmchnode: Unable to obtain the GPFS configuration file lock. Retrying ...


Per IBM support's direction, I already followed the Manual Repair 
Procedure<https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.3/com.ibm.spectrum.scale.v5r03.doc/bl1pdg_noccrreco_multinode.htm>,
 but now I'm back to square one with the same issue.

Also, I have a ticket over to IBM for troubleshooting this issue during our 
downtime this weekend, but support's response is really slow.

If possible, I'd prefer a webex session to facilitate closure, but I can send 
emails back and forth if necessary.


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]


-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20200919/f163918c/attachment-0001.html>

--

Message: 2
Date: Sat, 19 Sep 2020 20:23:01 +
From: Ryan Novosielski 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] CCR errors
Message-ID: 
Content-Type: text/plain; charset="utf-8"

I find them to be pretty fast and very experienced at severity 1. Don?t 
hesitate to use it (unless that?s already where you?re at).

--

|| \\UTGERS,   |---*O*---
||_// the State | Ryan Novosielski - 
novos...@rutgers.edu<mailto:novos...@rutgers.edu>
|| \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
||  \\of NJ | Office of Advanced Research Computing - MSB C630, Newark
`'

On Sep 19, 2020, at 16:11, Saula, Oluwasijibomi  
wrote:

?
Hello,

Anyone available to assist with CCR errors:



[root@nsd02 ~]# mmchnode -N nsd04-ib --quorum --manager

mmchnode: Unable to obtain the GPFS configuration file lock. Retrying ...


Per IBM support's direction, I already followed the Manual Repair 
Procedure<https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.3/com.ibm.spectrum.scale.v5r03.doc/bl1pdg_noccrreco_multinode.htm>,
 but now I'm back to square one with the same issue.

Also, I have a ticket over to IBM for troubleshooting this issue during our 
downtime this weekend, but support's response is really slow.

If possible, I'd prefer a webex session to facilitate closure, but I can send 
emails back and forth if necessary.


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20200919/0b8d7715/attachment.html>

--

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
h

[gpfsug-discuss] CCR errors

2020-09-19 Thread Saula, Oluwasijibomi
Hello,

Anyone available to assist with CCR errors:



[root@nsd02 ~]# mmchnode -N nsd04-ib --quorum --manager

mmchnode: Unable to obtain the GPFS configuration file lock. Retrying ...


Per IBM support's direction, I already followed the Manual Repair 
Procedure,
 but now I'm back to square one with the same issue.

Also, I have a ticket over to IBM for troubleshooting this issue during our 
downtime this weekend, but support's response is really slow.

If possible, I'd prefer a webex session to facilitate closure, but I can send 
emails back and forth if necessary.


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Short-term Deactivation of NSD server

2020-09-04 Thread Saula, Oluwasijibomi
Hello GPFS Experts,

Say, is there any way to disable a particular NSD server outside of shutting 
down GPFS on the server, or shutting down the entire cluster and removing the 
NSD server from the list of NSD servers?

I'm finding that TSM activity on one of our NSD servers is stifling IO traffic 
through the server and resulting in intermittent latency for clients. If we 
could restrict cluster IO from going through this NSD server, we might be able 
to minimize or eliminate the latencies experienced by the clients while TSM 
activity is ongoing.

Thoughts?



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average

2020-06-05 Thread Saula, Oluwasijibomi
Vladis/Kums/Fred/Kevin/Stephen,

Thanks so much for your insights, thoughts, and pointers! - Certainly increased 
my knowledge and understanding of potential culprits to watch for...

So we finally discovered the root issue to this problem: An unattended TSM 
restore exercise profusely writing to a single file, over and over again into 
the GBs!!..I'm opening up a ticket with TSM support to learn how to mitigate 
this in the future.

But with the RAID 6 writing costs Vladis explained, it now makes sense why the 
write IO was badly affected...

Excerpt from output file:


--- User Action is Required ---

File '/gpfs1/X/Y/Z/fileABC' is write protected


Select an appropriate action

  1. Force an overwrite for this object

  2. Force an overwrite on all objects that are write protected

  3. Skip this object

  4. Skip all objects that are write protected

  A. Abort this operation

Action [1,2,3,4,A] : The only valid responses are characters from this set: [1, 
2, 3, 4, A]

Action [1,2,3,4,A] : The only valid responses are characters from this set: [1, 
2, 3, 4, A]
Action [1,2,3,4,A] : The only valid responses are characters from this set: [1, 
2, 3, 4, A]
Action [1,2,3,4,A] : The only valid responses are characters from this set: [1, 
2, 3, 4, A]
...


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Friday, June 5, 2020 6:00 AM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 101, Issue 12

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Re: Client Latency and High NSD Server Load Average
  (Valdis Kl=?utf-8?Q?=c4=93?=tnieks)


--

Message: 1
Date: Thu, 04 Jun 2020 21:17:08 -0400
From: "Valdis Kl=?utf-8?Q?=c4=93?=tnieks" 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Client Latency and High NSD Server Load
Average
Message-ID: <309214.1591319828@turing-police>
Content-Type: text/plain; charset="us-ascii"

On Thu, 04 Jun 2020 15:33:18 -, "Saula, Oluwasijibomi" said:

> However, I still can't understand why write IO operations are 5x more latent
> than ready operations to the same class of disks.

Two things that may be biting you:

First, on a RAID 5 or 6 LUN, most of the time you only need to do 2 physical
reads (data and parity block). To do a write, you have to read the old parity
block, compute the new value, and write the data block and new parity block.
This is often called the "RAID write penalty".

Second, if a read size is smaller than the physical block size, the storage 
array can read
a block, and return only the fragment needed.  But on a write, it has to read
the whole block, splice in the new data, and write back the block - a RMW (read
modify write) cycle.
-- next part --
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 832 bytes
Desc: not available
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20200604/da016913/attachment-0001.sig>

--

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


End of gpfsug-discuss Digest, Vol 101, Issue 12
***
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average

2020-06-04 Thread Saula, Oluwasijibomi
Stephen,

Looked into client requests, and it doesn't seem to lean heavily on any one NSD 
server. Of course, this is an eyeball assessment after reviewing IO request 
percentages to the different NSD servers from just a few nodes.

By the way, I later discovered our TSM/NSD server couldn't handle restoring a 
read-only file and ended-up writing my output file into GBs asking for my 
response...that seemed to have contributed to some unnecessary high write IO.

However, I still can't understand why write IO operations are 5x more latent 
than ready operations to the same class of disks.

Maybe it's time for a GPFS support ticket...


Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Wednesday, June 3, 2020 9:19 PM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 101, Issue 9

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Re: Client Latency and High NSD Server Load Average
  (Stephen Ulmer)


--

Message: 1
Date: Wed, 3 Jun 2020 22:19:49 -0400
From: Stephen Ulmer 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Client Latency and High NSD Server Load
Average
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Note that if nsd02-ib is offline, that nsd03-ib is now servicing all of the 
NSDs for *both* servers, and that if nsd03-ib gets busy enough to appear 
offline, then nsd04-ib would be next in line to get the load of all 3. The two 
servers with the problems are in line after the one that is off.

This is based on the candy striping of the NSD server order (which I think most 
of us do).

NSD fail-over is ?straight-forward? so to speak - the last I checked, it is 
really fail-over in the listed order not load balancing among the servers 
(which is why you stripe them). I do *not* know if individual clients make the 
decision that the I/O for a disk should go through the ?next? NSD server, or if 
it is done cluster-wide (in the case of intermittently super-slow I/O). 
Hopefully someone with source code access will answer that, because now I?m 
curious...

Check what path the clients are using to the NSDs, i.e. which server. See if 
you are surprised. :)

 --
Stephen


> On Jun 3, 2020, at 6:03 PM, Saula, Oluwasijibomi 
>  wrote:
>
> ?
> Frederick,
>
> Yes on both counts! -  mmdf is showing pretty uniform (ie 5 NSDs out of 30 
> report 65% free; All others are uniform at 58% free)...
>
> NSD servers per disks are called in round-robin fashion as well, for example:
>
>  gpfs1 tier2_001nsd02-ib,nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib
>  gpfs1 tier2_002nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib
>  gpfs1 tier2_003nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib
>  gpfs1 tier2_004tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib,nsd04-ib
>
> Any other potential culprits to investigate?
>
> I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib is 
> offline for now):
> [nsd03-ib ~]# mmdiag --waiters
> === mmdiag: waiters ===
> Waiting 6.5113 sec since 17:17:33, monitored, thread 4175 NSDThread: for I/O 
> completion
> Waiting 6.3810 sec since 17:17:33, monitored, thread 4127 NSDThread: for I/O 
> completion
> Waiting 6.1959 sec since 17:17:34, monitored, thread 4144 NSDThread: for I/O 
> completion
>
> nsd04-ib:
> Waiting 13.1386 sec since 17:19:09, monitored, thread 9971 NSDThread: for I/O 
> completion
> Waiting 10.3562 sec since 17:19:12, monitored, thread 9958 NSDThread: for I/O 
> completion
> Waiting 10.0338 sec since 17:19:12, monitored, thread 9951 NSDThread: for I/O 
> completion
>
> tsm01-ib:
> Waiting 8.1211 sec since 17:20:24, monitored, thread 3644 NSDThread: for I/O 
> completion
> Waiting 7.6690 sec since 17:20:24, monitored, thread 3641 NSDThread: for I/O 
> completion
> Waiting 7.4969 sec since 17:20:24, monitored, thread 3658 NSDThread: for I/O 
> completion
> Waiting 7.3573 sec since 17:20:24, monitored, thread 3642 NSDThread: for I/O 
> c

Re: [gpfsug-discuss] Client Latency and High NSD Server Load Average

2020-06-03 Thread Saula, Oluwasijibomi
Frederick,

Yes on both counts! -  mmdf is showing pretty uniform (ie 5 NSDs out of 30 
report 65% free; All others are uniform at 58% free)...

NSD servers per disks are called in round-robin fashion as well, for example:


 gpfs1 tier2_001nsd02-ib,nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib

 gpfs1 tier2_002nsd03-ib,nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib

 gpfs1 tier2_003nsd04-ib,tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib

 gpfs1 tier2_004tsm01-ib,nsd01-ib,nsd02-ib,nsd03-ib,nsd04-ib

Any other potential culprits to investigate?

I do notice nsd03/nsd04 have long waiters, but nsd01 doesn't (nsd02-ib is 
offline for now):

[nsd03-ib ~]# mmdiag --waiters

=== mmdiag: waiters ===

Waiting 6.5113 sec since 17:17:33, monitored, thread 4175 NSDThread: for I/O 
completion

Waiting 6.3810 sec since 17:17:33, monitored, thread 4127 NSDThread: for I/O 
completion

Waiting 6.1959 sec since 17:17:34, monitored, thread 4144 NSDThread: for I/O 
completion


nsd04-ib:

Waiting 13.1386 sec since 17:19:09, monitored, thread 9971 NSDThread: for I/O 
completion

Waiting 10.3562 sec since 17:19:12, monitored, thread 9958 NSDThread: for I/O 
completion

Waiting 10.0338 sec since 17:19:12, monitored, thread 9951 NSDThread: for I/O 
completion


tsm01-ib:

Waiting 8.1211 sec since 17:20:24, monitored, thread 3644 NSDThread: for I/O 
completion

Waiting 7.6690 sec since 17:20:24, monitored, thread 3641 NSDThread: for I/O 
completion

Waiting 7.4969 sec since 17:20:24, monitored, thread 3658 NSDThread: for I/O 
completion

Waiting 7.3573 sec since 17:20:24, monitored, thread 3642 NSDThread: for I/O 
completion


nsd01-ib:

Waiting 0.2548 sec since 17:21:47, monitored, thread 30513 NSDThread: for I/O 
completion

Waiting 0.1502 sec since 17:21:47, monitored, thread 30529 NSDThread: for I/O 
completion



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu<http://www.ndsu.edu/>



[cid:image001.gif@01D57DE0.91C300C0]




From: gpfsug-discuss-boun...@spectrumscale.org 
 on behalf of 
gpfsug-discuss-requ...@spectrumscale.org 

Sent: Wednesday, June 3, 2020 4:56 PM
To: gpfsug-discuss@spectrumscale.org 
Subject: gpfsug-discuss Digest, Vol 101, Issue 6

Send gpfsug-discuss mailing list submissions to
gpfsug-discuss@spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-requ...@spectrumscale.org

You can reach the person managing the list at
gpfsug-discuss-ow...@spectrumscale.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Introducing SSUG::Digital
  (Simon Thompson (Spectrum Scale User Group Chair))
   2. Client Latency and High NSD Server Load Average
  (Saula, Oluwasijibomi)
   3. Re: Client Latency and High NSD Server Load Average
  (Frederick Stock)


--

Message: 1
Date: Wed, 03 Jun 2020 20:11:17 +0100
From: "Simon Thompson (Spectrum Scale User Group Chair)"

To: "gpfsug-discuss@spectrumscale.org"

Subject: [gpfsug-discuss] Introducing SSUG::Digital
Message-ID: 
Content-Type: text/plain; charset="utf-8"

Hi All.,



I happy that we can finally announce SSUG:Digital, which will be a series of 
online session based on the types of topic we present at our in-person events.



I know it?s taken use a while to get this up and running, but we?ve been 
working on trying to get the format right. So save the date for the first 
SSUG:Digital event which will take place on Thursday 18th June 2020 at 4pm BST. 
That?s:
San Francisco, USA at 08:00 PDT
New York, USA at 11:00 EDT
London, United Kingdom at 16:00 BST
Frankfurt, Germany at 17:00 CEST
Pune, India at 20:30 IST
We estimate about 90 minutes for the first session, and please forgive any 
teething troubles as we get this going!



(I know the times don?t work for everyone in the global community!)



Each of the sessions we run over the next few months will be a different 
Spectrum Scale Experts or Deep Dive session.

More details at:

https://www.spectrumscaleug.org/introducing-ssugdigital/



(We?ll announce the speakers and topic of the first session in the next few 
days ?)



Thanks to Ulf, Kristy, Bill, Bob and Ted for their help and guidance in getting 
this going.



We?re keen to include some user talks and site updates later in the series, so 
please let me know if you might be interested in presenting in this format.



Simon Thompson

SSUG Group Chair

-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/a

[gpfsug-discuss] Client Latency and High NSD Server Load Average

2020-06-03 Thread Saula, Oluwasijibomi

Hello,

Anyone faced a situation where a majority of NSDs have a high load average and 
a minority don't?

Also, is 10x NSD server latency for write operations than for read operations 
expected in any circumstance?

We are seeing client latency between 6 and 9 seconds and are wondering if some 
GPFS configuration or NSD server condition may be triggering this poor 
performance.



Thanks,


Oluwasijibomi (Siji) Saula

HPC Systems Administrator  /  Information Technology



Research 2 Building 220B / Fargo ND 58108-6050

p: 701.231.7749 / www.ndsu.edu



[cid:image001.gif@01D57DE0.91C300C0]


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Reason for shutdown: Reset old shared segment

2019-05-23 Thread Saula, Oluwasijibomi
Hey Folks,

I got a strange message one of my HPC cluster nodes that I'm hoping to 
understand better: "Reason for shutdown: Reset old shared segment"


2019-05-23_11:47:07.328-0500: [I] This node has a valid standard license

2019-05-23_11:47:07.327-0500: [I] Initializing the fast condition variables at 
0x57115300 ...

2019-05-23_11:47:07.328-0500: [I] mmfsd initializing. {Version: 5.0.0.0   
Built: Dec 10 2017 16:59:21} ...

2019-05-23_11:47:07.328-0500: [I] Cleaning old shared memory ...

2019-05-23_11:47:07.328-0500: [N] mmfsd is shutting down.

2019-05-23_11:47:07.328-0500: [N] Reason for shutdown: Reset old shared segment

Shortly after the GPFS is back up without any intervention:


2019-05-23_11:47:52.685-0500: [N] Remounted gpfs1

2019-05-23_11:47:52.691-0500: [N] mmfsd ready

I'm supposing this has to do with memory usage??...


Thanks,

Siji Saula
HPC System Administrator
Center for Computationally Assisted Science & Technology
NORTH DAKOTA STATE UNIVERSITY


Research 2 
Building
 – Room 220B
Dept 4100, PO Box 6050  / Fargo, ND 58108-6050
p:701.231.7749
www.ccast.ndsu.edu | 
www.ndsu.edu

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] gpfsug-discuss Digest, Vol 82, Issue 31

2018-11-21 Thread Saula, Oluwasijibomi
Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage
Message-ID: <1913697205.666954.1542805314...@mail.yahoo.com>
Content-Type: text/plain; charset="utf-8"

 At a guess with no data ?? if the application is opening more files than 
can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to 
support the open files,? but it will also scan to try and free any unused 
objects.??? If you can identify the user job that is causing this? you could 
monitor a system more closely.

Jim

    On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi 
 wrote:

  
Hello Scalers,




First, let me say Happy Thanksgiving to those of us in the US and to those 
beyond, well, it's a?still happy day seeing we're still above ground!?




Now, what I have to discuss isn't anything extreme so don't skip the turkey for 
this, but lately, on a few of our compute GPFS client nodes, we've been 
noticing high CPU usage by the mmfsd process and are wondering why. Here's a 
sample:






[~]# top -b -n 1 | grep mmfs

? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND










231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd

?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs




Obviously, this behavior was likely triggered by a not-so-convenient user job 
that in most cases is long finished by the time we investigate.?Nevertheless, 
does anyone have an idea why this might be happening? Any thoughts on 
preventive steps even?




This is GPFS v4.2.3 on Redhat 7.4, btw...





Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted 
Science & TechnologyNORTH DAKOTA STATE UNIVERSITY?
Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 
58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- next part --
An HTML attachment was scrubbed...
URL: 
<http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20181121/16de3172/attachment-0001.html>

--

Message: 2
Date: Wed, 21 Nov 2018 07:32:55 -0800
From: Sven Oehme 
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage
Message-ID:

Content-Type: text/plain; charset="utf-8"

Hi,

the best way to debug something like that is to start with top. start top
then press 1 and check if any of the cores has almost 0% idle while others
have plenty of CPU left. if that is the case you have one very hot thread.
to further isolate it you can press 1 again to collapse the cores, now
press shirt-h which will break down each thread of a process and show them
as an individual line.
now you either see one or many mmfsd's causing cpu consumption, if its many
your workload is just doing a lot of work, what is more concerning is if
you have just 1 thread running at the 90%+ . if thats the case write down
the PID of the thread that runs so hot and run mmfsadm dump
threads,kthreads  >dum. you will see many entries in the file like :

MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm
0x7FC978C0 highStackP 0x7FC783F7E530
  pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21
ThPoolCommands
  per-thread gbls:
0:0x0 1:0x0 2:0x0 3:0x3 4:0x 5:0x0 6:0x0
7:0x7FC98C0067B0
8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x40E 14:0x7FC98C004C10
15:0x0
16:0x4 17:0x0 18:0x0

find the pid behind 'thread id' and post that section, that would be the
first indication on what that thread does ...

sven






On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi <
oluwasijibomi.sa...@ndsu.edu> wrote:

> Hello Scalers,
>
>
> First, let me say Happy Thanksgiving to those of us in the US and to those
> beyond, well, it's a still happy day seeing we're still above ground! ?
>
>
> Now, what I have to discuss isn't anything extreme so don't skip the
> turkey for this, but lately, on a few of our compute GPFS client nodes,
> we've been noticing high CPU usage by the mmfsd process and are wondering
> why. Here's a sample:
>
>
> [~]# top -b -n 1 | grep mmfs
>
>PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>
>
> 231898 root   0 -20 14.508g 4.272g  70168 S  93.8  6.8  69503:41
> *mmfs*d
>
>   4161 root   0 -20  121876   9412   1492 S   0.0  0.0   0:00.22 run
> *mmfs*
>
> Obviously, this behavior was likely triggered by a not-so-convenient user
> job that in most cases is long finished by the time we
> investigate. Nevertheless, does anyone have an idea why this might be
> happening? Any thoughts on preventive steps even?
>
>
> This is GPFS v4.2.3 on Redhat 7.4, btw...
>
>
> Thanks

[gpfsug-discuss] mmfsd recording High CPU usage

2018-11-20 Thread Saula, Oluwasijibomi
Hello Scalers,


First, let me say Happy Thanksgiving to those of us in the US and to those 
beyond, well, it's a still happy day seeing we're still above ground! 


Now, what I have to discuss isn't anything extreme so don't skip the turkey for 
this, but lately, on a few of our compute GPFS client nodes, we've been 
noticing high CPU usage by the mmfsd process and are wondering why. Here's a 
sample:


[~]# top -b -n 1 | grep mmfs

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND


231898 root   0 -20 14.508g 4.272g  70168 S  93.8  6.8  69503:41 mmfsd

  4161 root   0 -20  121876   9412   1492 S   0.0  0.0   0:00.22 runmmfs


Obviously, this behavior was likely triggered by a not-so-convenient user job 
that in most cases is long finished by the time we investigate. Nevertheless, 
does anyone have an idea why this might be happening? Any thoughts on 
preventive steps even?


This is GPFS v4.2.3 on Redhat 7.4, btw...


Thanks,

Siji Saula
HPC System Administrator
Center for Computationally Assisted Science & Technology
NORTH DAKOTA STATE UNIVERSITY


Research 2 
Building
 – Room 220B
Dept 4100, PO Box 6050  / Fargo, ND 58108-6050
p:701.231.7749
www.ccast.ndsu.edu | 
www.ndsu.edu

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes

2018-03-10 Thread Saula, Oluwasijibomi
Wei -  So the expelled node could ping the rest of the cluster just fine. In 
fact, after adding this new node to the cluster I could traverse the filesystem 
for simple lookups, however, heavy data moves in or out of the filesystem 
seemed to trigger the expel messages to the new node.


This experience prompted my tunning exercise on the node and has since resolved 
the expel messages to node even during times of high I/O activity.


Nevertheless, I still have this nagging feeling that the IPoIB tuning for GPFS 
may not be optimal.


To answer your questions, Ed - IB supports both administrative and daemon 
communications, and we have verbsRdma configured.


Currently, we have both 2044 and 65520 MTU nodes on our IB network and I've 
been told this should not be the case. I'm hoping to settle on 4096 MTU nodes 
for the entire cluster but I fear there may be some caveats - any thoughts on 
this?


(Oh, Ed - Hideaki was my mentor for a short while when I began my HPC career 
with NDSU but he left us shortly after. Maybe like you I can tune up my 
Japanese as well once my GPFS issues are put to rest!  )


Thanks,

Siji Saula
HPC System Administrator
Center for Computationally Assisted Science & Technology
NORTH DAKOTA STATE UNIVERSITY


<https://www.ndsu.edu/alphaindex/buildings/Building::395>Research 2 
Building<https://www.ndsu.edu/alphaindex/buildings/Building::396><https://www.ndsu.edu/alphaindex/buildings/Building::395>
 – Room 220B
Dept 4100, PO Box 6050  / Fargo, ND 58108-6050
p:701.231.7749
www.ccast.ndsu.edu | 
www.ndsu.edu



From: Edward Wahl <ew...@osc.edu>
Sent: Friday, March 9, 2018 8:19:10 AM
To: gpfsug-discuss@spectrumscale.org
Cc: Saula, Oluwasijibomi
Subject: Re: [gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes


Welcome to the list.

If Hideaki Kikuchi is still around CCAST, say "Oh hisashiburi, des ne?" for me.
Though I recall he may have left.


A couple of questions as I, unfortunately, have a good deal of expel experience.

-Are you set up to use verbs or only IPoIB? "mmlsconfig verbsRdma"

-Are you using the IB as the administrative IP network?

-As Wei asked, can nodes sending the expel requests ping the victim over
whatever interface is being used administratively?  Other interfaces do NOT
matter for expels. Nodes that cannot even mount the file systems can still
request expels.  Many many things can cause issues here from routing and
firewalls to bad switch software which will not update ARP tables, and you get
nodes trying to expel each other.

-are your NSDs logging the expels in /tmp/mmfs?  You can mmchconfig
expelDataCollectionDailyLimit if you need more captures to narrow down what is
happening outside the mmfs.log.latest.  Just be wary of the disk space if you
have "expel storms".

-That tuning page is very out of date and appears to be mostly focused on GPFS
3.5.x tuning.   While there is also a Spectrum Scale wiki, it's Linux tuning
page does not appear to be kernel and network focused and is dated even older.


Ed



On Thu, 8 Mar 2018 15:06:03 +
"Saula, Oluwasijibomi" <oluwasijibomi.sa...@ndsu.edu> wrote:

> Hi Folks,
>
>
> As this is my first post to the group, let me start by saying I applaud the
> commentary from the user group as it has been a resource to those of us
> watching from the sidelines.
>
>
> That said, we have a GPFS layered on IPoIB, and recently, we started having
> some issues on our IB FDR fabric which manifested when GPFS began sending
> persistent expel messages to particular nodes.
>
>
> Shortly after, we embarked on a tuning exercise using IBM tuning
> recommendations<https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/Welcome%20to%20High%20Performance%20Computing%20%28HPC%29%20Central/page/Linux%20System%20Tuning%20Recommendations>
> but this page is quite old and we've run into some snags, specifically with
> setting 4k MTUs using mlx4_core/mlx4_en module options.
>
>
> While setting 4k MTUs as the guide recommends is our general inclination, I'd
> like to solicit some advice as to whether 4k MTUs are a good idea and any
> hitch-free steps to accomplishing this. I'm getting some conflicting remarks
> from Mellanox support asking why we'd want to use 4k MTUs with Unreliable
> Datagram mode.
>
>
> Also, any pointers to best practices or resources for network configurations
> for heavy I/O clusters would be much appreciated.
>
>
> Thanks,
>
> Siji Saula
> HPC System Administrator
> Center for Computationally Assisted Science & Technology
> NORTH DAKOTA STATE UNIVERSITY
>
>
> <https://www.ndsu.edu/alphaindex/buildings/Building::395>Research 2
> Building<https://www.ndsu.edu/alphaindex/buildings/Building::396><https://www.ndsu.edu/alphaindex/buildings/Building::395>

[gpfsug-discuss] Thoughts on GPFS on IB & MTU sizes

2018-03-08 Thread Saula, Oluwasijibomi
Hi Folks,


As this is my first post to the group, let me start by saying I applaud the 
commentary from the user group as it has been a resource to those of us 
watching from the sidelines.


That said, we have a GPFS layered on IPoIB, and recently, we started having 
some issues on our IB FDR fabric which manifested when GPFS began sending 
persistent expel messages to particular nodes.


Shortly after, we embarked on a tuning exercise using IBM tuning 
recommendations
 but this page is quite old and we've run into some snags, specifically with 
setting 4k MTUs using mlx4_core/mlx4_en module options.


While setting 4k MTUs as the guide recommends is our general inclination, I'd 
like to solicit some advice as to whether 4k MTUs are a good idea and any 
hitch-free steps to accomplishing this. I'm getting some conflicting remarks 
from Mellanox support asking why we'd want to use 4k MTUs with Unreliable 
Datagram mode.


Also, any pointers to best practices or resources for network configurations 
for heavy I/O clusters would be much appreciated.


Thanks,

Siji Saula
HPC System Administrator
Center for Computationally Assisted Science & Technology
NORTH DAKOTA STATE UNIVERSITY


Research 2 
Building
 – Room 220B
Dept 4100, PO Box 6050  / Fargo, ND 58108-6050
p:701.231.7749
www.ccast.ndsu.edu | 
www.ndsu.edu

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss