Luke, AFM is not tested for cascading configurations, this is getting added into the documentation for 4.2.1:
"Cascading of AFM caches is not tested." Thanks and Regards Radhika From: [email protected] To: [email protected] Date: 07/27/2016 04:30 PM Subject: gpfsug-discuss Digest, Vol 54, Issue 59 Sent by: [email protected] Send gpfsug-discuss mailing list submissions to [email protected] To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to [email protected] You can reach the person managing the list at [email protected] When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. AFM Crashing the MDS (Luke Raimbach) ---------------------------------------------------------------------- Message: 1 Date: Tue, 26 Jul 2016 14:17:35 +0000 From: Luke Raimbach <[email protected]> To: gpfsug main discussion list <[email protected]> Subject: [gpfsug-discuss] AFM Crashing the MDS Message-ID: <amspr03mb27605d717c5500d86f6adefb0...@amspr03mb276.eurprd03.prod.outlook.com> Content-Type: text/plain; charset="utf-8" Hi All, Anyone seen GPFS barf like this before? I'll explain the setup: RO AFM cache on remote site (cache A) for reading remote datasets quickly, LU AFM cache at destination site (cache B) for caching data from cache A (has a local compute cluster mounting this over multi-cluster), IW AFM cache at destination site (cache C) for presenting cache B over NAS protocols, Reading files in cache C should pull data from the remote source through cache A->B->C Modifying files in cache C should pull data into cache B and then break the cache relationship for that file, converting it to a local copy. Those modifications should include metadata updates (e.g. chown). To speed things up we prefetch files into cache B for datasets which are undergoing migration and have entered a read-only state at the source. When issuing chown on a directory in cache C containing ~4.5million files, the MDS for the AFM cache C crashes badly: Tue Jul 26 13:28:52.487 2016: [X] logAssertFailed: addr.isReserved() || addr.getClusterIdx() == clusterIdx Tue Jul 26 13:28:52.488 2016: [X] return code 0, reason code 1, log record tag 0 Tue Jul 26 13:28:53.392 2016: [X] *** Assert exp(addr.isReserved() || addr.getClusterIdx() == clusterIdx) in line 1936 of file /project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h Tue Jul 26 13:28:53.393 2016: [E] *** Traceback: Tue Jul 26 13:28:53.394 2016: [E] 2:0x7F6DC95444A6 logAssertFailed + 0x2D6 at ??:0 Tue Jul 26 13:28:53.395 2016: [E] 3:0x7F6DC95C7EF4 ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*) + 0x4B4 at ??:0 Tue Jul 26 13:28:53.396 2016: [E] 4:0x7F6DC95C8031 ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr, NodeAddr*, unsigned int) + 0x91 at ??:0 Tue Jul 26 13:28:53.397 2016: [E] 5:0x7F6DC9DC7126 SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) + 0x346 at ??:0 Tue Jul 26 13:28:53.398 2016: [E] 6:0x7F6DC9332494 HandleMBPcache(MBPcacheParms*) + 0xB4 at ??:0 Tue Jul 26 13:28:53.399 2016: [E] 7:0x7F6DC90A4A53 Mailbox::msgHandlerBody(void*) + 0x3C3 at ??:0 Tue Jul 26 13:28:53.400 2016: [E] 8:0x7F6DC908BC06 Thread::callBody(Thread*) + 0x46 at ??:0 Tue Jul 26 13:28:53.401 2016: [E] 9:0x7F6DC907A0D2 Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0 Tue Jul 26 13:28:53.402 2016: [E] 10:0x7F6DC87A3AA1 start_thread + 0xD1 at ??:0 Tue Jul 26 13:28:53.403 2016: [E] 11:0x7F6DC794A93D clone + 0x6D at ??:0 mmfsd: /project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h:1936: void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32, UInt32, const char*, const char*): Assertion `addr.isReserved() || addr.getClusterIdx() == clusterIdx' failed. Tue Jul 26 13:28:53.404 2016: [N] Signal 6 at location 0x7F6DC7894625 in process 6262, link reg 0xFFFFFFFFFFFFFFFF. Tue Jul 26 13:28:53.405 2016: [I] rax 0x0000000000000000 rbx 0x00007F6DC8DCB000 Tue Jul 26 13:28:53.406 2016: [I] rcx 0xFFFFFFFFFFFFFFFF rdx 0x0000000000000006 Tue Jul 26 13:28:53.407 2016: [I] rsp 0x00007F6DAAEA01F8 rbp 0x00007F6DCA05C8B0 Tue Jul 26 13:28:53.408 2016: [I] rsi 0x00000000000018F8 rdi 0x0000000000001876 Tue Jul 26 13:28:53.409 2016: [I] r8 0xFEFEFEFEFEFEFEFF r9 0xFEFEFEFEFF092D63 Tue Jul 26 13:28:53.410 2016: [I] r10 0x0000000000000008 r11 0x0000000000000202 Tue Jul 26 13:28:53.411 2016: [I] r12 0x00007F6DC9FC5540 r13 0x00007F6DCA05C1C0 Tue Jul 26 13:28:53.412 2016: [I] r14 0x0000000000000000 r15 0x0000000000000000 Tue Jul 26 13:28:53.413 2016: [I] rip 0x00007F6DC7894625 eflags 0x0000000000000202 Tue Jul 26 13:28:53.414 2016: [I] csgsfs 0x0000000000000033 err 0x0000000000000000 Tue Jul 26 13:28:53.415 2016: [I] trapno 0x0000000000000000 oldmsk 0x0000000010017807 Tue Jul 26 13:28:53.416 2016: [I] cr2 0x0000000000000000 Tue Jul 26 13:28:54.225 2016: [D] Traceback: Tue Jul 26 13:28:54.226 2016: [D] 0:00007F6DC7894625 raise + 35 at ??:0 Tue Jul 26 13:28:54.227 2016: [D] 1:00007F6DC7895E05 abort + 175 at ??:0 Tue Jul 26 13:28:54.228 2016: [D] 2:00007F6DC788D74E __assert_fail_base + 11E at ??:0 Tue Jul 26 13:28:54.229 2016: [D] 3:00007F6DC788D810 __assert_fail + 50 at ??:0 Tue Jul 26 13:28:54.230 2016: [D] 4:00007F6DC95444CA logAssertFailed + 2FA at ??:0 Tue Jul 26 13:28:54.231 2016: [D] 5:00007F6DC95C7EF4 ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*) + 4B4 at ??:0 Tue Jul 26 13:28:54.232 2016: [D] 6:00007F6DC95C8031 ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr, NodeAddr*, unsigned int) + 91 at ??:0 Tue Jul 26 13:28:54.233 2016: [D] 7:00007F6DC9DC7126 SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) + 346 at ??:0 Tue Jul 26 13:28:54.234 2016: [D] 8:00007F6DC9332494 HandleMBPcache(MBPcacheParms*) + B4 at ??:0 Tue Jul 26 13:28:54.235 2016: [D] 9:00007F6DC90A4A53 Mailbox::msgHandlerBody(void*) + 3C3 at ??:0 Tue Jul 26 13:28:54.236 2016: [D] 10:00007F6DC908BC06 Thread::callBody(Thread*) + 46 at ??:0 Tue Jul 26 13:28:54.237 2016: [D] 11:00007F6DC907A0D2 Thread::callBodyWrapper(Thread*) + A2 at ??:0 Tue Jul 26 13:28:54.238 2016: [D] 12:00007F6DC87A3AA1 start_thread + D1 at ??:0 Tue Jul 26 13:28:54.239 2016: [D] 13:00007F6DC794A93D clone + 6D at ??:0 Tue Jul 26 13:28:54.240 2016: [N] Restarting mmsdrserv Tue Jul 26 13:28:55.535 2016: [N] Signal 6 at location 0x7F6DC790EA7D in process 6262, link reg 0xFFFFFFFFFFFFFFFF. Tue Jul 26 13:28:55.536 2016: [N] mmfsd is shutting down. Tue Jul 26 13:28:55.537 2016: [N] Reason for shutdown: Signal handler entered Tue Jul 26 13:28:55 BST 2016: mmcommon mmfsdown invoked. Subsystem: mmfs Status: active Tue Jul 26 13:28:55 BST 2016: /var/mmfs/etc/mmfsdown invoked umount2: Device or resource busy umount: /camp: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) umount2: Device or resource busy umount: /ingest: device is busy. (In some cases useful info about processes that use the device is found by lsof(8) or fuser(1)) Shutting down NFS daemon: [ OK ] Shutting down NFS mountd: [ OK ] Shutting down NFS quotas: [ OK ] Shutting down NFS services: [ OK ] Shutting down RPC idmapd: [ OK ] Stopping NFS statd: [ OK ] Ugly, right? Cheers, Luke. Luke Raimbach? Senior HPC Data and Storage Systems Engineer, The Francis Crick Institute, Gibbs Building, 215 Euston Road, London NW1 2BE. E: [email protected] W: www.crick.ac.uk The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 215 Euston Road, London NW1 2BE. ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 54, Issue 59 **********************************************
_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
