I thought I'd share this with folks. I saw some log asserts in our test environment (~1050 client nodes and 12 manager/server nodes). I'm going from 3.5.0.31 (well, 2 clients are still at 3.5.0.19) -> 4.1.1.10. I've been running filebench in a loop for the past several days. It's sustaining about 60k write iops and about 15k read iops to the metadata disks for the filesystem I'm testing with, so I'd say it's getting pushed reasonably hard. The test cluster had 4.1 clients before it had 4.1 servers but after flipping 420 clients from 3.5.0.31 to 4.1.1.10 and starting up filebench I'm now seeing periodic logasserts from the manager/server nodes:

Dec 11 08:57:39 loremds12 mmfs: Generic error in /project/sprelfks2/build/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.C line 304 retCode 0, reasonCode 0 Dec 11 08:57:39 loremds12 mmfs: mmfsd: Error=MMFS_GENERIC, ID=0x30D9195E, Tag=4908715
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715
Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 (!"downgrade to mode which is not StrictlyWeaker") Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 node 584 old mode ro new mode (A: D: A) Dec 11 08:57:39 loremds12 mmfs: [X] logAssertFailed: (!"downgrade to mode which is not StrictlyWeaker") Dec 11 08:57:39 loremds12 mmfs: [X] return code 0, reason code 0, log record tag 0 Dec 11 08:57:42 loremds12 mmfs: [E] 10:0xA1BD5B RcvWorker::thread(void*).A1BD00 + 0x5B at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 11:0x622126 Thread::callBody(Thread*).6220E0 + 0x46 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 12:0x61220F Thread::callBodyWrapper(Thread*).612180 + 0x8F at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 13:0x7FF4E6BE66B6 start_thread + 0xE6 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 14:0x7FF4E5FEE06D clone + 0x6D at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 2:0x9F95E9 logAssertFailed.9F9440 + 0x1A9 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 3:0x1232836 TokenClass::fixClientMode(Token*, int, int, int, CopysetRevoke*).1232350 + 0x4E6 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 4:0x1235593 TokenClass::HandleTellRequest(RpcContext*, Request*, char**, int).1232AD0 + 0x2AC3 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 5:0x123A23C HandleTellRequestInterface(RpcContext*, Request*, char**, int).123A0D0 + 0x16C at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 6:0x125C6B0 queuedTellServer(RpcContext*, Request*, int, unsigned int).125C670 + 0x40 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 7:0x125EF72 tmHandleTellServer(RpcContext*, char*).125EEC0 + 0xB2 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 8:0xA12668 tscHandleMsg(RpcContext*, MsgDataBuf*).A120D0 + 0x598 at ??:0 Dec 11 08:57:42 loremds12 mmfs: [E] 9:0xA1BC4E RcvWorker::RcvMain().A1BB50 + 0xFE at ??:0
Dec 11 08:57:42 loremds12 mmfs: [E] *** Traceback:
Dec 11 08:57:42 loremds12 mmfs: [N] Signal 6 at location 0x7FF4E5F456D5 in process 12188, link reg 0xFFFFFFFFFFFFFFFF. Dec 11 08:57:42 loremds12 mmfs: [X] *** Assert exp((!"downgrade to mode which is not StrictlyWeaker") node 584 old mode ro new mode (A: D: A) ) in line 304 of file /project/sprelfks2/bui
ld/rfks2s010a/src/avs/fs/mmfs/ts/tm/HandleReq.C

I've seen different messages on that third line of the "Tag=" message:

Dec 11 00:16:40 loremds11 mmfs: Tag=5012168 node 825 old mode ro new mode 0x31 Dec 11 01:52:53 loremds10 mmfs: Tag=5016618 node 655 old mode ro new mode (A: MA D: ) Dec 11 02:15:57 loremds10 mmfs: Tag=5045549 node 994 old mode ro new mode (A: A D: A) Dec 11 08:14:22 loremds10 mmfs: Tag=5067054 node 237 old mode ro new mode 0x08 Dec 11 08:57:39 loremds12 mmfs: Tag=4908715 node 584 old mode ro new mode (A: D: A) Dec 11 00:47:39 loremds09 mmfs: Tag=4998635 node 461 old mode ro new mode (A:R D: )

It's interesting to note that all of these node indexes are still running 3.5. I'm going to open up a PMR but thought I'd share the gory details here and see if folks had any insight. I'm starting to wonder if 4.1 clients are more tolerant of 3.5 servers than 4.1 servers are of 3.5 clients.

-Aaron

On 12/5/16 4:31 PM, Aaron Knister wrote:
Hi Everyone,

In the GPFS documentation
(http://www.ibm.com/support/knowledgecenter/SSFKCN_4.1.0/com.ibm.cluster.gpfs.v4r1.gpfs300.doc/bl1ins_migratl.htm)
it has this to say about the duration of an upgrade from 3.5 to 4.1:

Rolling upgrades allow you to install new GPFS code one node at a time
without shutting down GPFS
on other nodes. However, you must upgrade all nodes within a short
time. The time dependency exists
because some GPFS 4.1 features become available on each node as soon as
the node is upgraded, while
other features will not become available until you upgrade all
participating nodes.

Does anyone have a feel for what "a short time" means? I'm looking to
upgrade from 3.5.0.31 to 4.1.1.10 in a rolling fashion but given the
size of our system it might take several weeks to complete. Seeing this
language concerns me that after some period of time something bad is
going to happen, but I don't know what that period of time is.

Also, if anyone has done a rolling 3.5 to 4.1 upgrade and has any
anecdotes they'd like to share, I would like to hear them.

Thanks!

-Aaron


--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

Reply via email to