Re: [gpfsug-discuss] GPFS 3.5 to 4.1 Upgrade Question

Aaron Knister Sun, 11 Dec 2016 07:07:45 -0800

I thought I'd share this with folks. I saw some log asserts in our testenvironment (~1050 client nodes and 12 manager/server nodes). I'm goingfrom 3.5.0.31 (well, 2 clients are still at 3.5.0.19) -> 4.1.1.10. I'vebeen running filebench in a loop for the past several days. It'ssustaining about 60k write iops and about 15k read iops to the metadatadisks for the filesystem I'm testing with, so I'd say it's gettingpushed reasonably hard. The test cluster had 4.1 clients before it had4.1 servers but after flipping 420 clients from 3.5.0.31 to 4.1.1.10 andstarting up filebench I'm now seeing periodic logasserts from themanager/server nodes:

Dec 11 08:57:39 loremds12 mmfs: Generic error in/project/sprelfks2/build/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.Cline 304 retCode 0, reasonCode 0Dec 11 08:57:39 loremds12 mmfs: mmfsd: Error=MMFS_GENERIC,ID=0x30D9195E, Tag=4908715

Dec 11 08:57:39 loremds12 mmfs: Tag=4908715

Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 (!"downgrade to mode whichis not StrictlyWeaker")Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 node 584 old mode ro newmode (A: D: A)Dec 11 08:57:39 loremds12 mmfs: [X] logAssertFailed: (!"downgrade tomode which is not StrictlyWeaker")Dec 11 08:57:39 loremds12 mmfs: [X] return code 0, reason code 0, logrecord tag 0Dec 11 08:57:42 loremds12 mmfs: [E] 10:0xA1BD5BRcvWorker::thread(void*).A1BD00 + 0x5B at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 11:0x622126Thread::callBody(Thread*).6220E0 + 0x46 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 12:0x61220FThread::callBodyWrapper(Thread*).612180 + 0x8F at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 13:0x7FF4E6BE66B6start_thread + 0xE6 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 14:0x7FF4E5FEE06D clone +0x6D at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 2:0x9F95E9logAssertFailed.9F9440 + 0x1A9 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 3:0x1232836TokenClass::fixClientMode(Token*, int, int, int, CopysetRevoke*).1232350+ 0x4E6 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 4:0x1235593TokenClass::HandleTellRequest(RpcContext*, Request*, char**,int).1232AD0 + 0x2AC3 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 5:0x123A23CHandleTellRequestInterface(RpcContext*, Request*, char**, int).123A0D0 +0x16C at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 6:0x125C6B0queuedTellServer(RpcContext*, Request*, int, unsigned int).125C670 +0x40 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 7:0x125EF72tmHandleTellServer(RpcContext*, char*).125EEC0 + 0xB2 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 8:0xA12668tscHandleMsg(RpcContext*, MsgDataBuf*).A120D0 + 0x598 at ??:0Dec 11 08:57:42 loremds12 mmfs: [E] 9:0xA1BC4ERcvWorker::RcvMain().A1BB50 + 0xFE at ??:0

Dec 11 08:57:42 loremds12 mmfs: [E] *** Traceback:

Dec 11 08:57:42 loremds12 mmfs: [N] Signal 6 at location 0x7FF4E5F456D5in process 12188, link reg 0xFFFFFFFFFFFFFFFF.Dec 11 08:57:42 loremds12 mmfs: [X] *** Assert exp((!"downgrade to modewhich is not StrictlyWeaker") node 584 old mode ro new mode (A: D:A) ) in line 304 of file /project/sprelfks2/bui

ld/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.C


I've seen different messages on that third line of the "Tag=" message:

Dec 11 00:16:40 loremds11 mmfs: Tag=5012168 node 825 old mode ro newmode 0x31Dec 11 01:52:53 loremds10 mmfs: Tag=5016618 node 655 old mode ro newmode (A: MA D: )Dec 11 02:15:57 loremds10 mmfs: Tag=5045549 node 994 old mode ro newmode (A: A D: A)Dec 11 08:14:22 loremds10 mmfs: Tag=5067054 node 237 old mode ro newmode 0x08Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 node 584 old mode ro newmode (A: D: A)Dec 11 00:47:39 loremds09 mmfs: Tag=4998635 node 461 old mode ro newmode (A:R D: )

It's interesting to note that all of these node indexes are stillrunning 3.5. I'm going to open up a PMR but thought I'd share the gorydetails here and see if folks had any insight. I'm starting to wonder if4.1 clients are more tolerant of 3.5 servers than 4.1 servers are of 3.5clients.


-Aaron

On 12/5/16 4:31 PM, Aaron Knister wrote:

Hi Everyone,

In the GPFS documentation
(http://www.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs300.doc/bl1ins_migratl.htm)
it has this to say about the duration of an upgrade from 3.5 to 4.1:

Rolling upgrades allow you to install new GPFS code one node at a time
without shutting down GPFS
on other nodes. However, you must upgrade all nodes within a short
time. The time dependency exists
because some GPFS 4.1 features become available on each node as soon as

the node is upgraded, while

other features will not become available until you upgrade all

participating nodes.

Does anyone have a feel for what "a short time" means? I'm looking to
upgrade from 3.5.0.31 to 4.1.1.10 in a rolling fashion but given the
size of our system it might take several weeks to complete. Seeing this
language concerns me that after some period of time something bad is
going to happen, but I don't know what that period of time is.

Also, if anyone has done a rolling 3.5 to 4.1 upgrade and has any
anecdotes they'd like to share, I would like to hear them.

Thanks!

-Aaron


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Re: [gpfsug-discuss] GPFS 3.5 to 4.1 Upgrade Question

Reply via email to