On Thu, 31 Jul 2025 07:47:00 -0600 Orion Poplawski <or...@nwra.com> wrote:
> On 7/28/25 03:12, Dan Horák wrote: > > On Mon, 28 Jul 2025 10:40:03 +0200 > > Florian Weimer <fwei...@redhat.com> wrote: > > > >> * Dan Horák: > >> > >>> On Sun, 27 Jul 2025 21:34:12 +0200 > >>> Sandro Mani <manisan...@gmail.com> wrote: > >>> > >>>> Hi > >>>> > >>>> scotch is currently FTBFS on ppc64le (affects the current 7.0.7, the > >>>> previous 7.0.6, as well as the new 7.0.8 release), failing with [1] > >>>> > >>>> gmake[2]: *** [src/libscotch/CMakeFiles/ptscotchf_h.dir/build.make:77: > >>>> src/include/ptscotchf.h] Illegal instruction (core dumped) > >>>> > >>>> 7.0.6 previously successfully built with gcc-0:15.0.1-0.3.fc42.1.ppc64le > >>>> and now fails with gcc-0:15.1.1-5.fc43.1.ppc64le, so this looks like a > >>>> gcc regression. > >>>> > >>>> Being this on ppc64le and having no access to such a machine, how can I > >>>> debug this? > >>> > >>> you have access to a system from > >>> https://fedoraproject.org/wiki/Test_Machine_Resources_For_Package_Maintainers > >>> and it can be reproduced there. But it needs the Power10 system (same > >>> as current koji builders), not the Power9 (pre-DC-migration koji > >>> builders, it builds there OK). > >>> > >>> ppc64le-redhat-linux-gnu-openmpi/src/libscotch/ptdummysizes is the > >>> crashing binary ... > >>> > >>> and running it under gdb gives > >>> > >>> ... > >>> Program received signal SIGILL, Illegal instruction. > >>> 0x00007ffff774e404 in sbrk () from /lib64/glibc-hwcaps/power10/libc.so.6 > >>> (gdb) where > >>> #0 0x00007ffff774e404 in sbrk () from > >>> /lib64/glibc-hwcaps/power10/libc.so.6 > >>> #1 0x00007ffff787b38c in ucm_fire_mmap_events_internal () from > >>> /lib64/libucm.so.0 > >>> #2 0x00007ffff787bd88 in ucm_mmap_test_events_nolock () from > >>> /lib64/libucm.so.0 > >>> #3 0x00007ffff78818b8 in ucm_mmap_install () from /lib64/libucm.so.0 > >>> #4 0x00007ffff7881b30 in ucm_mmap_init () from /lib64/libucm.so.0 > >>> #5 0x00007ffff7881c2c in ucm_library_init () from /lib64/libucm.so.0 > >>> #6 0x00007ffff7881cbc in ucm_set_global_opts () from /lib64/libucm.so.0 > >>> #7 0x00007ffff725745c in ucs_init_ucm_opts () from /lib64/libucs.so.0 > >>> #8 0x00007ffff7243fb0 in ucs_init () from /lib64/libucs.so.0 > >>> #9 0x00007ffff7f989bc in call_init (l=<optimized out>, argc=1, > >>> argv=0x7fffffffece8, env=0x7fffffffecf8) at dl-init.c:74 > >>> #10 _dl_init (main_map=0x7ffff7ff12f0, argc=1, argv=0x7fffffffece8, > >>> env=0x7fffffffecf8) at dl-init.c:121 > >>> #11 0x00007ffff7fc3eb8 in _dl_start_user () from /lib64/ld64.so.2 > >> > >> The location of the crash: > >> > >> Dump of assembler code for function __GI___sbrk: > >> 0x00007ffff774e400 <+0>: d1 ff 21 f8 stdu r1,-48(r1) > >> => 0x00007ffff774e404 <+4>: 0e 00 10 06 .long 0x610000e > >> 0x00007ffff774e408 <+8>: 00 00 60 3d lis r11,0 > >> 0x00007ffff774e40c <+12>: ff 7f 6b 61 ori r11,r11,32767 > >> 0x00007ffff774e410 <+16>: c7 07 6b 79 sldi. r11,r11,32 > >> 0x00007ffff774e414 <+20>: 87 f7 6b 65 oris r11,r11,63367 > >> 0x00007ffff774e418 <+24>: b8 9e 6b 61 ori r11,r11,40632 > >> 0x00007ffff774e41c <+28>: a6 03 69 7d mtctr r11 > >> 0x00007ffff774e420 <+32>: 20 04 80 4e bctr > >> 0x00007ffff774e424 <+36>: 40 00 01 f8 std r0,64(r1) > >> 0x00007ffff774e428 <+40>: 99 61 ff 4b bl 0x7ffff77445c0 > >> <__brk> > >> > >> This was patched by the ucx library. > >> > >> The original looks like this: > >> > >> 000000000014e400 <__sbrk>: > >> 14e400: d1 ff 21 f8 stdu r1,-48(r1) > >> 14e404: 0e 00 10 06 plbz r9,961781 # 2390f9 > >> <__libc_initial> > >> 14e408: f5 ac 20 89 > >> 14e40c: 78 1b 62 7c mr r2,r3 > >> 14e410: 00 00 09 2c cmpwi r9,0 > >> 14e414: 4c 00 82 40 bne 14e460 <__sbrk+0x60> > >> 14e418: 00 00 23 2c cmpdi r3,0 > >> 14e41c: b0 00 82 40 bne 14e4cc <__sbrk+0xcc> > >> 14e420: a6 02 08 7c mflr r0 > >> 14e424: 40 00 01 f8 std r0,64(r1) > >> 14e428: 99 61 ff 4b bl 1445c0 <brk> > >> > >> So there was a 64-bit instruction bundle at the patched offset, and that > >> may have been the reason why ucx failed to patch properly. > > > > I agree > > > >> I would very much prefer if there weren't any libraries like ucx in > >> Fedora that patch glibc merely because you link against them. It's fine > >> to do this for debugging tools, but as part of regular execution, it > >> risks too much breakage. > > > > thanks, Florian, for the insight > > > > IMO this issue is also causing the openmpi build failure in the > > mass-rebuild > > - https://koji.fedoraproject.org/koji/taskinfo?taskID=135241418 > > > > > > Dan > > I'm planning on dropping ucx support in openmpi on ppc64le: > https://src.fedoraproject.org/rpms/openmpi/pull-request/24 ack, makes sense > This should hopefully fix a lot of FTBFS issues with MPI using packages. > > Thank you very much for the detailed analysis - I would not have known > where to start. right, it is a weird one ... Dan -- _______________________________________________ devel mailing list -- devel@lists.fedoraproject.org To unsubscribe send an email to devel-le...@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue