Having a sanctioned way to compile targeting a version of the kernel that is installed — but not running — would be helpful in many circumstances.
— Stephen > On Jan 17, 2020, at 11:58 AM, Ryan Novosielski <[email protected]> wrote: > > Yeah, support got back to me with a similar response earlier today that I’d > not seen yet that made it a lot clearer what I “did wrong". This would appear > to be the cause in my case: > > [root@master config]# diff env.mcr env.mcr-1062.9.1 > 4,5c4,5 > < #define LINUX_KERNEL_VERSION 31000999 > < #define LINUX_KERNEL_VERSION_VERBOSE 310001062009001 > --- >> #define LINUX_KERNEL_VERSION 31001062 >> #define LINUX_KERNEL_VERSION_VERBOSE 31001062009001 > > > …the former having been generated by “make Autoconfig” and the latter > generated by my brain. I’m surprised at the first line — I’d have caught > myself that something different might have been needed if 3.10.0-1062 didn’t > already fit in the number of digits. > > Anyway, I explained to support that the reason I do this is that I maintain a > couple of copies of env.mcr because occasionally there will be reasons to > need gpfs.gplbin for a few different kernel versions (other software that > doesn't want to be upgraded, etc.). I see I originally got this practice from > the README (or possibly our original installer consultants). > > Basically what’s missing here, so far as I can see, is a way to use > mmbuildgpl/make Autoconfig but specify a target kernel version (and I guess > an update to the docs or at least /usr/lpp/mmfs/src/README) that doesn’t > suggest manually editing. Is there a way to at least find out what "make > Autoconfig” would use for a target LINUX_KERNEL_VERSION_VERBOSE? From what I > can see of makefile and config/configure, there’s no option for specifying > anything. > > -- > ____ > || \\UTGERS, > |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - [email protected] > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, > Newark > `' > >> On Jan 17, 2020, at 11:36 AM, Felipe Knop <[email protected]> wrote: >> >> Hi Ryan, >> >> My interpretation of the analysis so far is that the content of >> LINUX_KERNEL_VERSION_VERBOSE in ' env.mcr' became incorrect. That is, it >> used to work well in a prior release of Scale, but not with 5.0.4.1 . This >> is because of a code change that added another digit to the version in >> LINUX_KERNEL_VERSION_VERBOSE to account for the 4-digit "fix level" >> (3.10.0-1000+) . Then, when the GPL layer was built, its sources saw the >> content of LINUX_KERNEL_VERSION_VERBOSE with the missing extra digit and >> compiled the 'wrong' pieces in -- in particular the incorrect value of >> SECURITY_INODE_INIT_SECURITY() . And that led to the crash. >> >> The problem did not happen when mmbuildgpl was used since the correct value >> of LINUX_KERNEL_VERSION_VERBOSE was then set up. >> >> Felipe >> >> ---- >> Felipe Knop [email protected] >> GPFS Development and Security >> IBM Systems >> IBM Building 008 >> 2455 South Rd, Poughkeepsie, NY 12601 >> (845) 433-9314 T/L 293-9314 >> >> >> >> ----- Original message ----- >> From: Ryan Novosielski <[email protected]> >> Sent by: [email protected] >> To: gpfsug main discussion list <[email protected]> >> Cc: >> Subject: [EXTERNAL] Re: [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 >> on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM >> guests >> Date: Fri, Jan 17, 2020 10:56 AM >> >> That /is/ interesting. >> >> I’m a little confused about how that could be playing out in a case where >> I’m building on -1062.9.1, building for -1062.9.1, and running on -1062.9.1. >> Is there something inherent in the RPM building process that hasn’t caught >> up, or am I misunderstanding that change’s impact on it? >> >> -- >> ____ >> || \\UTGERS, |---------------------------*O*--------------------------- >> ||_// the State | Ryan Novosielski - [email protected] >> || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus >> || \\ of NJ | Office of Advanced Research Computing - MSB C630, >> Newark >> `' >> >>> On Jan 17, 2020, at 10:35, Felipe Knop <[email protected]> wrote: >>> >>> >>> Hi Ryan, >>> >>> Some interesting IBM-internal communication overnight. The problems seems >>> related to a change made to LINUX_KERNEL_VERSION_VERBOSE to handle the >>> additional digit in the kernel numbering (3.10.0-1000+) . The GPL layer >>> expected LINUX_KERNEL_VERSION_VERBOSE to have that extra digit, and its >>> absence resulted in an incorrect function being compiled in, which led to >>> the crash. >>> >>> This, at least, seems to make sense, in terms of matching to the symptoms >>> of the problem. >>> >>> We are still in internal debates on whether/how update our guidelines for >>> gplbin generation ... >>> >>> Regards, >>> >>> Felipe >>> >>> ---- >>> Felipe Knop [email protected] >>> GPFS Development and Security >>> IBM Systems >>> IBM Building 008 >>> 2455 South Rd, Poughkeepsie, NY 12601 >>> (845) 433-9314 T/L 293-9314 >>> >>> >>> >>> ----- Original message ----- >>> From: Ryan Novosielski <[email protected]> >>> Sent by: [email protected] >>> To: "[email protected]" <[email protected]> >>> Cc: >>> Subject: [EXTERNAL] Re: [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 >>> on Spectrum Scale Data Access Edition installed via gpfs.gplbin RPM on KVM >>> guests >>> Date: Thu, Jan 16, 2020 4:33 PM >>> >>> -----BEGIN PGP SIGNED MESSAGE----- >>> Hash: SHA1 >>> >>> Hi Felipe, >>> >>> I either misunderstood support or convinced them to take further >>> action. It at first looked like they were suggesting "mmbuildgpl fixed >>> it: case closed" (I know they wanted to close the SalesForce case >>> anyway, which would prevent communication on the issue). At this >>> point, they've asked for a bunch more information. >>> >>> Support is asking similar questions re: the speculations, and I'll >>> provide them with the relevant output ASAP, but I did confirm all of >>> that, including that there were no stray mmfs26/tracedev kernel >>> modules anywhere else in the relevant /lib/modules PATHs. In the >>> original case, I built on a machine running 3.10.0-957.27.2, but >>> pointed to the 3.10.0-1062.9.1 source code/defined the relevant >>> portions of usr/lpp/mmfs/src/config/env.mcr. That's always worked >>> before, and rebuilding once the build system was running >>> 3.10.0-1062.9.1 as well did not change anything either. In all cases, >>> the GPFS version was Spectrum Scale Data Access Edition 5.0.4-1. If >>> you build against either the wrong kernel version or the wrong GPFS >>> version, both will appear right in the filename of the gpfs.gplbin RPM >>> you build. Mine is called: >>> >>> gpfs.gplbin-3.10.0-1062.9.1.el7.x86_64-5.0.4-1.x86_64.rpm >>> >>> Anyway, thanks for your response; I know you might not be >>> following/working on this directly, but I figured the extra info might >>> be of interest. >>> >>> On 1/16/20 8:41 AM, Felipe Knop wrote: >>>> Hi Ryan, >>>> >>>> I'm aware of this ticket, and I understand that there has been >>>> active communication with the service team on this problem. >>>> >>>> The crash itself, as you indicate, looks like a problem that has >>>> been fixed: >>>> >>>> https://www.ibm.com/support/pages/ibm-spectrum-scale-gpfs-releases-423 >>> 13-or-later-and-5022-or-later-have-issues-where-kernel-crashes-rhel76-0 >>>> >>>> The fact that the problem goes away when *mmbuildgpl* is issued >>>> appears to point to some incompatibility with kernel levels and/or >>>> Scale version levels. Just speculating, some possible areas may >>>> be: >>>> >>>> >>>> * The RPM might have been built on a version of Scale without the >>>> fix * The RPM might have been built on a different (minor) version >>>> of the kernel * Somehow the VM picked a "leftover" GPFS kernel >>>> module, as opposed to the one included in gpfs.gplbin -- given >>>> that mmfsd never complained about a missing GPL kernel module >>>> >>>> >>>> Felipe >>>> >>>> ---- Felipe Knop [email protected] GPFS Development and Security IBM >>>> Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 >>>> (845) 433-9314 T/L 293-9314 >>>> >>>> >>>> >>>> >>>> ----- Original message ----- From: Ryan Novosielski >>>> <[email protected]> Sent by: >>>> [email protected] To: gpfsug main discussion >>>> list <[email protected]> Cc: Subject: [EXTERNAL] >>>> [gpfsug-discuss] Kernel BUG/panic in mm/slub.c:3772 on Spectrum >>>> Scale Data Access Edition installed via gpfs.gplbin RPM on KVM >>>> guests Date: Wed, Jan 15, 2020 4:11 PM >>>> >>>> Hi there, >>>> >>>> I know some of the Spectrum Scale developers look at this list. >>>> I’m having a little trouble with support on this problem. >>>> >>>> We are seeing crashes with GPFS 5.0.4-1 Data Access Edition on KVM >>>> guests with a portability layer that has been installed via >>>> gpfs.gplbin RPMs that we built at our site and have used to >>>> install GPFS all over our environment. We’ve not seen this problem >>>> so far on any physical hosts, but have now experienced it on guests >>>> running on number of our KVM hypervisors, across vendors and >>>> firmware versions, etc. At one time I thought it was all happening >>>> on systems using Mellanox virtual functions for Infiniband, but >>>> we’ve now seen it on VMs without VFs. There may be an SELinux >>>> interaction, but some of our hosts have it disabled outright, some >>>> are Permissive, and some were working successfully with 5.0.2.x >>>> GPFS. >>>> >>>> What I’ve been instructed to try to solve this problem has been to >>>> run “mmbuildgpl”, and it has solved the problem. I don’t consider >>>> running "mmbuildgpl" a real solution, however. If RPMs are a >>>> supported means of installation, it should work. Support told me >>>> that they’d seen this solve the problem at another site as well. >>>> >>>> Does anyone have any more information about this problem/whether >>>> there’s a fix in the pipeline, or something that can be done to >>>> cause this problem that we could remedy? Is there an easy place to >>>> see a list of eFixes to see if this has come up? I know it’s very >>>> similar to a problem that happened I believe it was after 5.0.2.2 >>>> and Linux 3.10.0-957.19.1, but that was fixed already in 5.0.3.x. >>>> >>>> Below is a sample of the crash output: >>>> >>>> [ 156.733477] kernel BUG at mm/slub.c:3772! [ 156.734212] invalid >>>> opcode: 0000 [#1] SMP [ 156.735017] Modules linked in: ebtable_nat >>>> ebtable_filter ebtable_broute bridge stp llc ebtables mmfs26(OE) >>>> mmfslinux(OE) tracedev(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) >>>> iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) >>>> mlx4_en(OE) mlx4_ib(OE) mlx4_core(OE) ip6table_nat nf_nat_ipv6 >>>> ip6table_mangle ip6table_raw nf_conntrack_ipv6 nf_defrag_ipv6 >>>> ip6table_filter ip6_tables iptable_nat nf_nat_ipv4 nf_nat >>>> iptable_mangle iptable_raw nf_conntrack_ipv4 nf_defrag_ipv4 >>>> xt_comment xt_multiport xt_conntrack nf_conntrack iptable_filter >>>> iptable_security nfit libnvdimm ppdev iosf_mbi crc32_pclmul >>>> ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper >>>> ablk_helper sg joydev pcspkr cryptd parport_pc parport i2c_piix4 >>>> virtio_balloon knem(OE) binfmt_misc ip_tables xfs libcrc32c >>>> mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) sr_mod cdrom ata_generic >>>> pata_acpi virtio_console virtio_net virtio_blk crct10dif_pclmul >>>> crct10dif_common mlx5_core(OE) mlxfw(OE) crc32c_intel ptp pps_core >>>> devlink ata_piix serio_raw mlx_compat(OE) libata virtio_pci floppy >>>> virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod [ >>>> 156.754814] CPU: 3 PID: 11826 Comm: request_handle* Tainted: G OE >>>> ------------ 3.10.0-1062.9.1.el7.x86_64 #1 [ 156.756782] >>>> Hardware name: Red Hat KVM, BIOS 1.11.0-2.el7 04/01/2014 [ >>>> 156.757978] task: ffff8aeca5bf8000 ti: ffff8ae9f7a24000 task.ti: >>>> ffff8ae9f7a24000 [ 156.759326] RIP: 0010:[<ffffffffbbe23dec>] >>>> [<ffffffffbbe23dec>] kfree+0x13c/0x140 [ 156.760749] RSP: >>>> 0018:ffff8ae9f7a27278 EFLAGS: 00010246 [ 156.761717] RAX: >>>> 001fffff00000400 RBX: ffffffffbc6974bf RCX: ffffa74dc1bcfb60 [ >>>> 156.763030] RDX: 001fffff00000000 RSI: ffff8aed90fc6500 RDI: >>>> ffffffffbc6974bf [ 156.764321] RBP: ffff8ae9f7a27290 R08: >>>> 0000000000000014 R09: 0000000000000003 [ 156.765612] R10: >>>> 0000000000000048 R11: ffffdb5a82d125c0 R12: ffffa74dc4fd36c0 [ >>>> 156.766938] R13: ffffffffc0a1c562 R14: ffff8ae9f7a272f8 R15: >>>> ffff8ae9f7a27938 [ 156.768229] FS: 00007f8ffff05700(0000) >>>> GS:ffff8aedbfd80000(0000) knlGS:0000000000000000 [ 156.769708] CS: >>>> 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 156.770754] CR2: >>>> 000055963330e2b0 CR3: 0000000325ad2000 CR4: 00000000003606e0 [ >>>> 156.772076] DR0: 0000000000000000 DR1: 0000000000000000 DR2: >>>> 0000000000000000 [ 156.773367] DR3: 0000000000000000 DR6: >>>> 00000000fffe0ff0 DR7: 0000000000000400 [ 156.774663] Call Trace: [ >>>> 156.775154] [<ffffffffc0a1c562>] >>>> cxiInitInodeSecurityCleanup+0x12/0x20 [mmfslinux] [ 156.776568] >>>> [<ffffffffc0b50562>] >>>> _Z17newInodeInitLinuxP15KernelOperationP13gpfsVfsData_tPP8OpenFilePPvP >>> P10gpfsNode_tP7FileUIDS6_N5LkObj12LockModeEnumE+0x152/0x290 >>>> >>>> >>> [mmfs26] >>>> [ 156.779378] [<ffffffffc0b5cdfa>] >>>> _Z9gpfsMkdirP13gpfsVfsData_tP15KernelOperationP9cxiNode_tPPvPS4_PyS5_P >>> cjjjP10ext_cred_t+0x46a/0x7e0 >>>> >>>> >>> [mmfs26] >>>> [ 156.781689] [<ffffffffc0bdb928>] ? >>>> _ZN14BaseMutexClass15releaseLockHeldEP16KernelSynchState+0x18/0x130 >>>> >>>> >>> [mmfs26] >>>> [ 156.783565] [<ffffffffc0c3db2d>] >>>> _ZL21pcacheHandleCacheMissP13gpfsVfsData_tP15KernelOperationP10gpfsNod >>> e_tPvPcPyP12pCacheResp_tPS5_PS4_PjSA_j+0x4bd/0x760 >>>> >>>> >>> [mmfs26] >>>> [ 156.786228] [<ffffffffc0c40675>] >>>> _Z12pcacheLookupP13gpfsVfsData_tP15KernelOperationP10gpfsNode_tPvPcP7F >>> ilesetjjjPS5_PS4_PyPjS9_+0x1ff5/0x21a0 >>>> >>>> >>> [mmfs26] >>>> [ 156.788681] [<ffffffffc0c023ef>] ? >>>> _Z15findFilesetByIdP15KernelOperationjjPP7Filesetj+0x4f/0xa0 >>>> [mmfs26] [ 156.790448] [<ffffffffc0b6d59c>] >>>> _Z10gpfsLookupP13gpfsVfsData_tPvP9cxiNode_tS1_S1_PcjPS1_PS3_PyP10cxiVa >>> ttr_tPjP10ext_cred_tjS5_PiS4_SD_+0x65c/0xad0 >>>> >>>> >>> [mmfs26] >>>> [ 156.793032] [<ffffffffc0b8b022>] ? >>>> _Z33gpfsIsCifsBypassTraversalCheckingv+0xe2/0x130 [mmfs26] [ >>>> 156.794588] [<ffffffffc0a36d96>] gpfs_i_lookup+0x2e6/0x5a0 >>>> [mmfslinux] [ 156.795838] [<ffffffffc0b6cf40>] ? >>>> _Z8gpfsLinkP13gpfsVfsData_tP9cxiNode_tS2_PvPcjjP10ext_cred_t+0x6c0/0x6 >>> c0 >>>> >>>> >>> [mmfs26] >>>> [ 156.797753] [<ffffffffbbe65d52>] ? __d_alloc+0x122/0x180 [ >>>> 156.798763] [<ffffffffbbe65e10>] ? d_alloc+0x60/0x70 [ >>>> 156.799700] [<ffffffffbbe556d3>] lookup_real+0x23/0x60 [ >>>> 156.800651] [<ffffffffbbe560f2>] __lookup_hash+0x42/0x60 [ >>>> 156.801675] [<ffffffffbc377874>] lookup_slow+0x42/0xa7 [ >>>> 156.802634] [<ffffffffbbe5ac3f>] link_path_walk+0x80f/0x8b0 [ >>>> 156.803666] [<ffffffffbbe5ae4a>] path_lookupat+0x7a/0x8b0 [ >>>> 156.804690] [<ffffffffbbdcd2fe>] ? lru_cache_add+0xe/0x10 [ >>>> 156.805690] [<ffffffffbbe24ef5>] ? kmem_cache_alloc+0x35/0x1f0 [ >>>> 156.806766] [<ffffffffbbe5c45f>] ? getname_flags+0x4f/0x1a0 [ >>>> 156.807817] [<ffffffffbbe5b6ab>] filename_lookup+0x2b/0xc0 [ >>>> 156.808834] [<ffffffffbbe5d5f7>] user_path_at_empty+0x67/0xc0 [ >>>> 156.809923] [<ffffffffbbdf3ecd>] ? handle_mm_fault+0x39d/0x9b0 [ >>>> 156.811017] [<ffffffffbbe5d661>] user_path_at+0x11/0x20 [ >>>> 156.811983] [<ffffffffbbe50343>] vfs_fstatat+0x63/0xc0 [ >>>> 156.812951] [<ffffffffbbe506fe>] SYSC_newstat+0x2e/0x60 [ >>>> 156.813931] [<ffffffffbc388a26>] ? trace_do_page_fault+0x56/0x150 >>>> [ 156.815050] [<ffffffffbbe50bbe>] SyS_newstat+0xe/0x10 [ >>>> 156.816010] [<ffffffffbc38dede>] system_call_fastpath+0x25/0x2a [ >>>> 156.817104] Code: 49 8b 03 31 f6 f6 c4 40 74 04 41 8b 73 68 4c 89 >>>> df e8 89 2f fa ff eb 84 4c 8b 58 30 48 8b 10 80 e6 80 4c 0f 44 d8 >>>> e9 28 ff ff ff <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 >>>> 41 55 41 54 [ 156.822192] RIP [<ffffffffbbe23dec>] >>>> kfree+0x13c/0x140 [ 156.823180] RSP <ffff8ae9f7a27278> [ >>>> 156.823872] ---[ end trace 142960be4a4feed8 ]--- [ 156.824806] >>>> Kernel panic - not syncing: Fatal exception [ 156.826475] Kernel >>>> Offset: 0x3ac00000 from 0xffffffff81000000 (relocation range: >>>> 0xffffffff80000000-0xffffffffbfffffff) >>>> >>>> -- ____ || \\UTGERS, >>>> |---------------------------*O*--------------------------- ||_// >>>> the State | Ryan Novosielski - [email protected] || \\ >>>> University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS >>>> Campus || \\ of NJ | Office of Advanced Research Computing - >>>> MSB C630, Newark `' >>>> >>>> _______________________________________________ gpfsug-discuss >>>> mailing list gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ gpfsug-discuss >>>> mailing list gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> - -- >>> ____ >>> || \\UTGERS, |----------------------*O*------------------------ >>> ||_// the State | Ryan Novosielski - [email protected] >>> || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus >>> || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark >>> `' >>> -----BEGIN PGP SIGNATURE----- >>> >>> iF0EARECAB0WIQST3OUUqPn4dxGCSm6Zv6Bp0RyxvgUCXiDWSgAKCRCZv6Bp0Ryx >>> vpCsAKCQ2ykmeycbOVrHTGaFqb2SsU26NwCg3YyYi4Jy2d+xZjJkE6Vfht8O8gM= >>> =9rKb >>> -----END PGP SIGNATURE----- >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss
