Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

2019-01-30 Thread Dwayne.Hart
Could you get away with running “mmdiag —stats” and inspecting the uptime 
information it provides?

Best,
Dwayne
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine
Memorial University of Newfoundland
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631

On Jan 30, 2019, at 5:32 PM, Sanchez, Paul 
mailto:paul.sanc...@deshaw.com>> wrote:

There are some cases which I don’t believe can be caught with callbacks (e.g. 
DMS = Dead Man Switch).  But you could possibly use preStartup to check the 
host uptime to make an assumption if GPFS was restarted long after the host 
booted.  You could also peek in /tmp/mmfs and only report if you find something 
there.  That said, the docs say that preStartup fires after the node joins the 
cluster.  So if that means once the node is ‘active’ then you might miss out on 
nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong 
and attempt to right those which are safe to fix, and raise alerts 
appropriately.  Something like that, outside the reach of GPFS, is often a good 
choice if you don’t need to know something the moment it happens.

Thx
Paul

From: 
gpfsug-discuss-boun...@spectrumscale.org
 
mailto:gpfsug-discuss-boun...@spectrumscale.org>>
 On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list 
mailto:gpfsug-discuss@spectrumscale.org>>
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS 
callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What 
I’m really looking for is did the node shutdown (abort) and create a dump in 
/tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Hanging file-systems

2018-11-27 Thread Dwayne.Hart
Hi Simon,

Was there a reason behind swap being disabled?

Best,
Dwayne
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine
Memorial University of Newfoundland
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631

On Nov 27, 2018, at 2:24 PM, Simon Thompson 
mailto:s.j.thomp...@bham.ac.uk>> wrote:

Thanks Sven …

We found a node with kswapd running 100% (and swap was off)…

Killing that node made access to the FS spring into life.

Simon

From: 
mailto:gpfsug-discuss-boun...@spectrumscale.org>>
 on behalf of "oeh...@gmail.com" 
mailto:oeh...@gmail.com>>
Reply-To: 
"gpfsug-discuss@spectrumscale.org" 
mailto:gpfsug-discuss@spectrumscale.org>>
Date: Tuesday, 27 November 2018 at 16:14
To: "gpfsug-discuss@spectrumscale.org" 
mailto:gpfsug-discuss@spectrumscale.org>>
Subject: Re: [gpfsug-discuss] Hanging file-systems

1. are you under memory pressure or even worse started swapping .
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Best way to migrate data

2018-10-19 Thread Dwayne.Hart
Thank you Ryan. I’ll have a more in-depth look at this application later today 
and see how it deals with some of the large genetic files that are generated by 
the sequencer. By copying it from GPFS fs to another GPFS fs.

Best,
Dwayne
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine 
Memorial University of Newfoundland 
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631

> On Oct 19, 2018, at 7:04 AM, Ryan Novosielski  wrote:
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> We use parsyncfp. Our target is not GPFS, though. I was really hoping
> to hear about something snazzier for GPFS-GPFS. Lenovo would probably
> tell you that HSM is the way to go (we asked something similar for a
> replacement for our current setup or for distributed storage).
> 
>> On 10/18/2018 01:19 PM, dwayne.h...@med.mun.ca wrote:
>> Hi,
>> 
>> Just wondering what the best recipe for migrating a user’s home
>> directory content from one GFPS file system to another which hosts
>> a larger research GPFS file system? I’m currently using rsync and
>> it has maxed out the client system’s IB interface.
>> 
>> Best, Dwayne — Dwayne Hart | Systems Administrator IV
>> 
>> CHIA, Faculty of Medicine Memorial University of Newfoundland 300
>> Prince Philip Drive St. John’s, Newfoundland | A1B 3V6 Craig L
>> Dobbin Building | 4M409 T 709 864 6631 
>> ___ gpfsug-discuss
>> mailing list gpfsug-discuss at spectrumscale.org 
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>> 
> 
> - -- 
> 
> || \\UTGERS, |--*O*
> ||_// the State  |Ryan Novosielski - novos...@rutgers.edu
> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus
> ||  \\of NJ  | Office of Advanced Res. Comp. - MSB C630, Newark
>  `'
> -BEGIN PGP SIGNATURE-
> 
> iEYEARECAAYFAlvI51AACgkQmb+gadEcsb62SQCfWBAru3KkJd+UftG2BXaRzjTG
> p/wAn0mpC5XCZc50fZfMPRRXR40HsmEk
> =dMDg
> -END PGP SIGNATURE-
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Best way to migrate data

2018-10-19 Thread Dwayne.Hart
Hi JAB,

We do not have either ILM or HSM. Thankfully, we have at minimum IBM Spectrum 
Protect (I recently updated the system to version 8.1.5). 

It would be an interesting exercise to see how long it would take IBM SP to 
restore a user's content fully to a different target. I have done some smaller 
recoveries so I know that the system is in a usable state ;)

Best,
Dwayne

-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Jonathan Buzzard
Sent: Friday, October 19, 2018 6:39 AM
To: gpfsug-discuss@spectrumscale.org
Subject: Re: [gpfsug-discuss] Best way to migrate data

On 18/10/2018 18:19, dwayne.h...@med.mun.ca wrote:
> Hi,
> 
> Just wondering what the best recipe for migrating a user’s home 
> directory content from one GFPS file system to another which hosts a 
> larger research GPFS file system? I’m currently using rsync and it has 
> maxed out the client system’s IB interface.
> 

Be careful with rsync, it resets all your atimes which screws up any hope of 
doing ILM or HSM.

My personal favourite is to do something along the lines of

   dsmc restore /gpfs/

Minimal impact on the user facing services, and seems to preserve atimes last 
time I checked. Sure it tanks your backup server a bit, but that is not user 
facing. What do users care if the backup takes longer than normal.

Of course this presumes you have a backup :-)


JAB.

-- 
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG 
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Best way to migrate data

2018-10-18 Thread Dwayne.Hart
Thank you all for the responses. I'm currently using msrsync and things appear 
to be going very well.

The data transfer is contained inside our DC. I'm transferring a user's home 
directory content from one GPFS file system to another. Our IBM Spectrum Scale 
Solution consists of 12 IO nodes connected to IB and the client node that I'm 
transferring the data from one fs to another is also connected to IB with a 
possible maximum of 2 hops. 

[root@client-system]# /gpfs/home/dwayne/bin/msrsync -P --stats -p 32 
/gpfs/home/user/ /research/project/user/
[64756/992397 entries] [30.1 T/239.6 T transferred] [81 entries/s] [39.0 G/s 
bw] [monq 0] [jq 62043]

Best,
Dwayne

-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of Christopher Black
Sent: Thursday, October 18, 2018 4:43 PM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Best way to migrate data

Other tools and approaches that we've found helpful:
msrsync: handles parallelizing rsync within a dir tree and can greatly speed up 
transfers on a single node with both filesystems mounted, especially when 
dealing with many small files
Globus/GridFTP: set up one or more endpoints on each side, gridftp will auto 
parallelize and recover from disruptions

msrsync is easier to get going but is limited to one parent dir per node. We've 
sometimes done an additional level of parallelization by running msrsync with 
different top level directories on different hpc nodes simultaneously.

Best,
Chris

Refs:
https://github.com/jbd/msrsync
https://www.globus.org/

On 10/18/18, 2:54 PM, "gpfsug-discuss-boun...@spectrumscale.org on behalf of 
Sanchez, Paul"  wrote:

Sharding can also work, if you have a storage-connected compute grid in 
your environment:  If you enumerate all of the directories, then use a 
non-recursive rsync for each one, you may be able to parallelize the workload 
by using several clients simultaneously.  It may still max out the links of 
these clients (assuming your source read throughput and target write throughput 
bottlenecks aren't encountered first) but it may run that way for 1/100th of 
the time if you can use 100+ machines.

-Paul
-Original Message-
From: gpfsug-discuss-boun...@spectrumscale.org 
 On Behalf Of Buterbaugh, Kevin L
Sent: Thursday, October 18, 2018 2:26 PM
To: gpfsug main discussion list 
Subject: Re: [gpfsug-discuss] Best way to migrate data

Hi Dwayne,

I’m assuming you can’t just let an rsync run, possibly throttled in some 
way?  If not, and if you’re just tapping out your network, then would it be 
possible to go old school?  We have parts of the Medical Center here where 
their network connections are … um, less than robust.  So they tar stuff up to 
a portable HD, sneaker net it to us, and we untar is from an NSD server.

HTH, and I really hope that someone has a better idea than that!

Kevin

> On Oct 18, 2018, at 12:19 PM, dwayne.h...@med.mun.ca wrote:
>
> Hi,
>
> Just wondering what the best recipe for migrating a user’s home directory 
content from one GFPS file system to another which hosts a larger research GPFS 
file system? I’m currently using rsync and it has maxed out the client system’s 
IB interface.
>
> Best,
> Dwayne
> —
> Dwayne Hart | Systems Administrator IV
>
> CHIA, Faculty of Medicine
> Memorial University of Newfoundland
> 300 Prince Philip Drive
> St. John’s, Newfoundland | A1B 3V6
> Craig L Dobbin Building | 4M409
> T 709 864 6631
> ___
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> 
https://urldefense.proofpoint.com/v2/url?u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fgpfsug.org-252Fmailman-252Flistinfo-252Fgpfsug-2Ddiscuss-26amp-3Bdata-3D02-257C01-257CKevin.Buterbaugh-2540vanderbilt.edu-257Ccca728d2d61f4be06bcd08d6351f3650-257Cba5a7f39e3be4ab3b45067fa80faecad-257C0-257C0-257C636754805507359478-26amp-3Bsdata-3D2YAiqgqKl4CerlyCn3vJ9v9u-252FrGzbfa7aKxJ0PYV-252Fhc-253D-26amp-3Breserved-3D0=DwIGaQ=C9X8xNkG_lwP_-eFHTGejw=DopWM-bvfskhBn2zeglfyyw5U2pumni6m_QzQFYFepU=e-U5zXflwxr0w9-5ia0FHn3tF1rwmM1qciZNrBLwFeg=NVJncSq-SKJSPgljdYqLDoy753jhxiKJNI2M8CexJME=

___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org

https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss=DwIGaQ=C9X8xNkG_lwP_-eFHTGejw=DopWM-bvfskhBn2zeglfyyw5U2pumni6m_QzQFYFepU=e-U5zXflwxr0w9-5ia0FHn3tF1rwmM1qciZNrBLwFeg=oM0Uo8pPSV5bUj2Hyjzvw1q12Oug_mH-aYsM_R4Zfv4=
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org


[gpfsug-discuss] Best way to migrate data

2018-10-18 Thread Dwayne.Hart
Hi,

Just wondering what the best recipe for migrating a user’s home directory 
content from one GFPS file system to another which hosts a larger research GPFS 
file system? I’m currently using rsync and it has maxed out the client system’s 
IB interface.

Best,
Dwayne 
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine 
Memorial University of Newfoundland 
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-05-22 Thread Dwayne.Hart
We are having issues with ESS/Mellanox implementation and were curious as to 
what you were working with from a network perspective.

Best,
Dwayne
—
Dwayne Hart | Systems Administrator IV

CHIA, Faculty of Medicine
Memorial University of Newfoundland
300 Prince Philip Drive
St. John’s, Newfoundland | A1B 3V6
Craig L Dobbin Building | 4M409
T 709 864 6631

On May 22, 2018, at 2:10 PM, 
"vall...@cbio.mskcc.org" 
> wrote:

10G Ethernet.

Thanks,
Lohit

On May 22, 2018, 11:55 AM -0400, 
dwayne.h...@med.mun.ca, wrote:
Hi Lohit,

What type of network are you using on the back end to transfer the GPFS traffic?

Best,
Dwayne

From: 
gpfsug-discuss-boun...@spectrumscale.org
 [mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of 
vall...@cbio.mskcc.org
Sent: Tuesday, May 22, 2018 1:13 PM
To: gpfsug main discussion list 
>
Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from 
GPFS 5.0.0-2 to GPFS 4.2.3.2

Hello All,

We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. 
We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we 
have not run the mmchconfig release=LATEST command)
Right after the upgrade, we are seeing many “ps hangs" across the cluster. All 
the “ps hangs” happen when jobs run related to a Java process or many Java 
threads (example: GATK )
The hangs are pretty random, and have no particular pattern except that we know 
that it is related to just Java or some jobs reading from directories with 
about 60 files.

I have raised an IBM critical service request about a month ago related to this 
- PMR: 24090,L6Q,000.
However, According to the ticket  - they seemed to feel that it might not be 
related to GPFS.
Although, we are sure that these hangs started to appear only after we upgraded 
GPFS to GPFS 5.0.0.2 from 4.2.3.2.

One of the other reasons we are not able to prove that it is GPFS is because, 
we are unable to capture any logs/traces from GPFS once the hang happens.
Even GPFS trace commands hang, once “ps hangs” and thus it is getting difficult 
to get any dumps from GPFS.

Also  - According to the IBM ticket, they seemed to have a seen a “ps hang" 
issue and we have to run  mmchconfig release=LATEST command, and that will 
resolve the issue.
However we are not comfortable making the permanent change to Filesystem 
version 5. and since we don’t see any near solution to these hangs - we are 
thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the 
cluster was stable.

Can downgrading GPFS take us back to exactly the previous GPFS config state?
With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall 
all rpms to a previous version? or is there anything else that i need to make 
sure with respect to GPFS configuration?
Because i think that GPFS 5.0 might have updated internal default GPFS 
configuration parameters , and i am not sure if downgrading GPFS will change 
them back to what they were in GPFS 4.2.3.2

Our previous state:

2 Storage clusters - 4.2.3.2
1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )

Our current state:

2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
1 Compute cluster - 5.0.0.2

Do i need to downgrade all the clusters to go to the previous state ? or is it 
ok if we just downgrade the compute cluster to previous version?

Any advice on the best steps forward, would greatly help.

Thanks,

Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2

2018-05-22 Thread Dwayne.Hart
Hi Lohit,

What type of network are you using on the back end to transfer the GPFS traffic?

Best,
Dwayne

From: gpfsug-discuss-boun...@spectrumscale.org 
[mailto:gpfsug-discuss-boun...@spectrumscale.org] On Behalf Of 
vall...@cbio.mskcc.org
Sent: Tuesday, May 22, 2018 1:13 PM
To: gpfsug main discussion list 
Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from 
GPFS 5.0.0-2 to GPFS 4.2.3.2

Hello All,

We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. 
We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we 
have not run the mmchconfig release=LATEST command)
Right after the upgrade, we are seeing many “ps hangs" across the cluster. All 
the “ps hangs” happen when jobs run related to a Java process or many Java 
threads (example: GATK )
The hangs are pretty random, and have no particular pattern except that we know 
that it is related to just Java or some jobs reading from directories with 
about 60 files.

I have raised an IBM critical service request about a month ago related to this 
- PMR: 24090,L6Q,000.
However, According to the ticket  - they seemed to feel that it might not be 
related to GPFS.
Although, we are sure that these hangs started to appear only after we upgraded 
GPFS to GPFS 5.0.0.2 from 4.2.3.2.

One of the other reasons we are not able to prove that it is GPFS is because, 
we are unable to capture any logs/traces from GPFS once the hang happens.
Even GPFS trace commands hang, once “ps hangs” and thus it is getting difficult 
to get any dumps from GPFS.

Also  - According to the IBM ticket, they seemed to have a seen a “ps hang" 
issue and we have to run  mmchconfig release=LATEST command, and that will 
resolve the issue.
However we are not comfortable making the permanent change to Filesystem 
version 5. and since we don’t see any near solution to these hangs - we are 
thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the 
cluster was stable.

Can downgrading GPFS take us back to exactly the previous GPFS config state?
With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall 
all rpms to a previous version? or is there anything else that i need to make 
sure with respect to GPFS configuration?
Because i think that GPFS 5.0 might have updated internal default GPFS 
configuration parameters , and i am not sure if downgrading GPFS will change 
them back to what they were in GPFS 4.2.3.2

Our previous state:

2 Storage clusters - 4.2.3.2
1 Compute cluster - 4.2.3.2  ( remote mounts the above 2 storage clusters )

Our current state:

2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2)
1 Compute cluster - 5.0.0.2

Do i need to downgrade all the clusters to go to the previous state ? or is it 
ok if we just downgrade the compute cluster to previous version?

Any advice on the best steps forward, would greatly help.

Thanks,

Lohit
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] Introduction to the "gpfsug-discuss" mailing list

2018-03-28 Thread Dwayne.Hart
Hi,

My name is Dwayne Hart. I currently work for the Center for Health Informatics 
& Analytics (CHIA), Faculty of Medicine at Memorial University of Newfoundland 
as a Systems/Network Security Administrator. In this role I am responsible for 
several HPC (Intel and Power) instances, OpenStack cloud environment and 
research data. We leverage IBM Spectrum Scale Storage as our primary storage 
solution. I have been working with GPFS since 2015.

Best,
Dwayne
---
Systems Administrator
Center for Health Informatics & Analytics (CHIA)
Craig L. Dobbin Center for Genetics
Room 4M409
300 Prince Philip Dr.
St. John’s, NL Canada
A1B 3V6
Tel:  (709) 864-6631
E Mail:  dwayne.h...@med.mun.ca
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss