You know, I ran into this during the upgrade, didn’t have an explanation for 
it, and decided that I would see what happened when it was finished (problem 
was gone). I was going from DSS-G 3.2a to DSS-G 4.3a. It was at the stage where 
I had one machine upgraded and one machine not upgraded, which makes sense. I 
believe I then tried it from the upgraded node, it worked fine, and I figured 
“well, I’m blowing that one away shortly anyway.”

Would one expect to only see this while nodes in the same cluster are not 
running the same version? I guess what I’m asking is would one expect to have 
that kind of traffic coming from any place else except the same cluster?

Sent from my iPhone

On Mar 21, 2023, at 08:38, Luke Sudbery <[email protected]> wrote:


Thank you, I’ll pass that onto Lenovo. I think I’ve also identified it as this 
in the fixies in 5.1.6.1 too:
https://www.ibm.com/support/pages/apar/IJ44607
IJ44607: SPECTRUM SCALE V5.1.3 AND GREATER FAILS TO INTEROPERATE WITH 
SPECTRUMSCALE VERSIONS PRIOR TO 5.1.3.
Problem summary

·         GNR RPCs fail when received by a GPFS daemon 5.1.3 or later

·         from a GPFS daemon older than version 5.1.3.

·         Kernel assert going off: privVfsP != NULL

Symptom:

·         Hang in the command

As such, AFAICT it will affect all Lenovo customers going from DSSG 3.x (GPFS 
5.1.1.0) to 4.x (GPFS 5.1.5.1 efix20) and I’m a bit annoyed that didn’t pick it 
up before release.
Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

From: Gang Qiu <[email protected]>
Sent: 21 March 2023 12:31
To: Luke Sudbery (Advanced Research Computing) <[email protected]>; gpfsug 
main discussion list <[email protected]>
Subject: 答复: mmvdisk version/communication issues?

CAUTION: This email originated from outside the organisation. Do not click 
links or open attachments unless you recognise the sender and know the content 
is safe.



Luke,

1. The case number is TS011014198 (Internal defect number is 1163600)
   5.1.3.1 efix47 includes this fix. (I didn't find the efix for 5.1.4 or 5.1.5)
2. The impact is that any GNR-related command from 5.1.2 to 5.1.5 will hang.

Regards,
Gang Qiu
====================================================
Gang Qiu(邱钢)
Spectrum Scale Development - ECE
IBM China Systems Lab
Mobile: +86-18612867902
====================================================


发件人: Luke Sudbery <[email protected]<mailto:[email protected]>>
日期: 星期二, 2023年3月21日 17:31
收件人: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>, Gang Qiu 
<[email protected]<mailto:[email protected]>>
主题: [EXTERNAL] RE: mmvdisk version/communication issues?
Thank you, that’s far more useful than Lenovo support have been so far! 
Unfortunately Lenovo only support particular versions of scale on their DSSG 
and they are our only source of the required GNR packages. So a couple more 
questions if you
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
Thank you, that’s far more useful than Lenovo support have been so far!

Unfortunately Lenovo only support particular versions of scale on their DSSG 
and they are our only source of the required GNR packages.

So a couple more questions if you don’t mind!


  *   Can you provide a link the APAR or similar I can share with Lenovo?
  *   Do you know of any workaround or other impact of this issue? (I think 
I’ve seen mmheath show false errors because of it). This may help us to decide 
whether to just press ahead upgrading to 5.1.5 everywhere if the issue is not 
present 5.1.5 -> 5.1.5.

Many thanks,

Luke


--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

From: gpfsug-discuss 
<[email protected]<mailto:[email protected]>> 
On Behalf Of Gang Qiu
Sent: 21 March 2023 03:49
To: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
Subject: [gpfsug-discuss] 答复: mmvdisk version/communication issues?

CAUTION: This email originated from outside the organisation. Do not click 
links or open attachments unless you recognise the sender and know the content 
is safe.




It is not mmvdisk issue. Mmvdisk hangs because ‘tslsrecgroup’ command hangs 
which executed by mmvdisk.
I recreated it on my environment (coexist 5.1.3 or 5.1.4, 5.1.5 with 5.1.2 or 
previous version)

After further investigation, it should be a known issue which is fixed in 5.1.6 
PTF1.
To resolve this issue, we need to upgrade the version to 5.1.6.1 at least.


Regards,
Gang Qiu
====================================================
Gang Qiu(邱钢)
Spectrum Scale Development - ECE
IBM China Systems Lab
Mobile: +86-18612867902
====================================================


发件人: gpfsug-discuss 
<[email protected]<mailto:[email protected]>> 
代表 Luke Sudbery <[email protected]<mailto:[email protected]>>
日期: 星期六, 2023年3月18日 00:06
收件人: gpfsug main discussion list 
<[email protected]<mailto:[email protected]>>
主题: [EXTERNAL] Re: [gpfsug-discuss] mmvdisk version/communication issues?
On further investigation the command does eventually complete, after 11 minutes 
rather than a couple of seconds. [root@ rds-pg-dssg01 ~]# time mmvdisk pdisk 
list --rg rds_er_dssg02 --not-ok mmvdisk: All pdisks of recovery group 
'rds_er_dssg02'
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd
On further investigation the command does eventually complete, after 11 minutes 
rather than a couple of seconds.


[root@rds-pg-dssg01 ~]# time mmvdisk pdisk list --rg rds_er_dssg02 --not-ok

mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok.



real    11m14.106s

user    0m1.430s

sys     0m0.555s

[root@rds-pg-dssg01 ~]#

Looking at the process tree, the bits that hang are:

[root@rds-pg-dssg01 ~]# time tslsrecgroup rds_er_dssg02 -Y --v2 --failure-domain

Failed to connect to file system daemon: Connection timed out



real    5m30.181s

user    0m0.001s

sys     0m0.003s

and then

[root@rds-pg-dssg01 ~]# time tslspdisk --recovery-group rds_er_dssg02 --notOK

Failed to connect to file system daemon: Connection timed out



real    5m30.247s

user    0m0.003s

sys     0m0.002s

[root@rds-pg-dssg01 ~]#

Which adds up to the 11 minutes.... then it does something else and just works. 
Or maybe it doesn't work and just wouldn't report any failed disks is there 
were any….

While hanging, the ts commands appear to be LISTENing, not attempting to make 
connections:


[root@rds-pg-dssg01 ~]# pidof tslspdisk

2156809

[root@rds-pg-dssg01 ~]# netstat -apt | grep 2156809

tcp        0      0 0.0.0.0:60000           0.0.0.0:*               LISTEN      
2156809/tslspdisk

[root@rds-pg-dssg01 ~]#

Port 60000 is the lowest of our tscCmdPortRange.

Don’t know if that helps anyone….

Cheers,

Luke
--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

From: gpfsug-discuss 
<[email protected]<mailto:[email protected]>> 
On Behalf Of Luke Sudbery
Sent: 17 March 2023 15:11
To: [email protected]<mailto:[email protected]>
Subject: [gpfsug-discuss] mmvdisk version/communication issues?

Hello,

We 3 Lenovo DSSG “Building Blocks” as they call them – 2x GNR server pairs.

We’ve just upgraded the 1st of them from 3.2a (GPFS 5.1.1.0) to 4.3a (5.1.5.1 
efix 20).

Now the older systems can’t communicated with the newer in certain 
circumstances, specifically querying recovery groups hosted on other servers.

It works old->old, new->old and new->new but not old->new.

If fairly sure it is not a TCP comms problem. I can ssh between the node as 
root and as the GPFS sudoUser. Port 1191 and the tscCmdPortRange are open and 
accessible in both direction between the nodes. There are connections present 
between the nodes in netstat and in mmfsd.latest.log. No pending message (to 
that node) in mmdiag --network.

In these examples rds-er-dssg01/2 are upgraded, rds-pg-dssg01/2 are downlevel:


[root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok  # New 
to new

mmvdisk: All pdisks of recovery group 'rds_er_dssg02' are ok.

[root@rds-er-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok  # New 
to old

mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok.

[root@rds-er-dssg01 ~]#


[root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_pg_dssg02 --not-ok # Old to 
old

mmvdisk: All pdisks of recovery group 'rds_pg_dssg02' are ok.

[root@rds-pg-dssg01 ~]# mmvdisk pdisk list --rg rds_er_dssg02 --not-ok # Old to 
new [HANGS]

^Cmmvdisk: Command failed. Examine previous error messages to determine cause.

[root@rds-pg-dssg01 ~]#

Has anyone come across this? mmvdisk should work across slightly different 
versions of 5.1, right? No recovery group, cluster or filesystem versions have 
been changed yet.

We will also log a ticket with snaps and more info but wondered if anyone had 
seen this.

And while this particular command is not a major issue, we don’t know what else 
it may affect, before we proceed with the reset of the cluster.

Many thanks,

Luke

--
Luke Sudbery
Principal Engineer (HPC and Storage).
Architecture, Infrastructure and Systems
Advanced Research Computing, IT Services
Room 132, Computer Centre G5, Elms Road

Please note I don’t work on Monday.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org

Reply via email to