On Thu, 31 Jul 2025 07:47:00 -0600
Orion Poplawski <or...@nwra.com> wrote:

> On 7/28/25 03:12, Dan Horák wrote:
> > On Mon, 28 Jul 2025 10:40:03 +0200
> > Florian Weimer <fwei...@redhat.com> wrote:
> > 
> >> * Dan Horák:
> >>
> >>> On Sun, 27 Jul 2025 21:34:12 +0200
> >>> Sandro Mani <manisan...@gmail.com> wrote:
> >>>
> >>>> Hi
> >>>>
> >>>> scotch is currently FTBFS on ppc64le (affects the current 7.0.7, the
> >>>> previous 7.0.6, as well as the new 7.0.8 release), failing with [1]
> >>>>
> >>>> gmake[2]: *** [src/libscotch/CMakeFiles/ptscotchf_h.dir/build.make:77: 
> >>>> src/include/ptscotchf.h] Illegal instruction (core dumped)
> >>>>
> >>>> 7.0.6 previously successfully built with gcc-0:15.0.1-0.3.fc42.1.ppc64le 
> >>>> and now fails with gcc-0:15.1.1-5.fc43.1.ppc64le, so this looks like a 
> >>>> gcc regression.
> >>>>
> >>>> Being this on ppc64le and having no access to such a machine, how can I 
> >>>> debug this?
> >>>
> >>> you have access to a system from
> >>> https://fedoraproject.org/wiki/Test_Machine_Resources_For_Package_Maintainers
> >>> and it can be reproduced there. But it needs the Power10 system (same
> >>> as current koji builders), not the Power9 (pre-DC-migration koji
> >>> builders, it builds there OK).
> >>>
> >>> ppc64le-redhat-linux-gnu-openmpi/src/libscotch/ptdummysizes is the
> >>> crashing binary ...
> >>>
> >>> and running it under gdb gives
> >>>
> >>> ...
> >>> Program received signal SIGILL, Illegal instruction.
> >>> 0x00007ffff774e404 in sbrk () from /lib64/glibc-hwcaps/power10/libc.so.6
> >>> (gdb) where
> >>> #0  0x00007ffff774e404 in sbrk () from 
> >>> /lib64/glibc-hwcaps/power10/libc.so.6
> >>> #1  0x00007ffff787b38c in ucm_fire_mmap_events_internal () from 
> >>> /lib64/libucm.so.0
> >>> #2  0x00007ffff787bd88 in ucm_mmap_test_events_nolock () from 
> >>> /lib64/libucm.so.0
> >>> #3  0x00007ffff78818b8 in ucm_mmap_install () from /lib64/libucm.so.0
> >>> #4  0x00007ffff7881b30 in ucm_mmap_init () from /lib64/libucm.so.0
> >>> #5  0x00007ffff7881c2c in ucm_library_init () from /lib64/libucm.so.0
> >>> #6  0x00007ffff7881cbc in ucm_set_global_opts () from /lib64/libucm.so.0
> >>> #7  0x00007ffff725745c in ucs_init_ucm_opts () from /lib64/libucs.so.0
> >>> #8  0x00007ffff7243fb0 in ucs_init () from /lib64/libucs.so.0
> >>> #9  0x00007ffff7f989bc in call_init (l=<optimized out>, argc=1, 
> >>> argv=0x7fffffffece8, env=0x7fffffffecf8) at dl-init.c:74
> >>> #10 _dl_init (main_map=0x7ffff7ff12f0, argc=1, argv=0x7fffffffece8, 
> >>> env=0x7fffffffecf8) at dl-init.c:121
> >>> #11 0x00007ffff7fc3eb8 in _dl_start_user () from /lib64/ld64.so.2
> >>
> >> The location of the crash:
> >>
> >> Dump of assembler code for function __GI___sbrk:
> >>     0x00007ffff774e400 <+0>:     d1 ff 21 f8     stdu    r1,-48(r1)
> >> => 0x00007ffff774e404 <+4>:     0e 00 10 06     .long 0x610000e
> >>     0x00007ffff774e408 <+8>:     00 00 60 3d     lis     r11,0
> >>     0x00007ffff774e40c <+12>:    ff 7f 6b 61     ori     r11,r11,32767
> >>     0x00007ffff774e410 <+16>:    c7 07 6b 79     sldi.   r11,r11,32
> >>     0x00007ffff774e414 <+20>:    87 f7 6b 65     oris    r11,r11,63367
> >>     0x00007ffff774e418 <+24>:    b8 9e 6b 61     ori     r11,r11,40632
> >>     0x00007ffff774e41c <+28>:    a6 03 69 7d     mtctr   r11
> >>     0x00007ffff774e420 <+32>:    20 04 80 4e     bctr
> >>     0x00007ffff774e424 <+36>:    40 00 01 f8     std     r0,64(r1)
> >>     0x00007ffff774e428 <+40>:    99 61 ff 4b     bl      0x7ffff77445c0 
> >> <__brk>
> >>
> >> This was patched by the ucx library.
> >>
> >> The original looks like this:
> >>
> >> 000000000014e400 <__sbrk>:
> >>    14e400:       d1 ff 21 f8     stdu    r1,-48(r1)
> >>    14e404:       0e 00 10 06     plbz    r9,961781       # 2390f9 
> >> <__libc_initial>
> >>    14e408:       f5 ac 20 89
> >>    14e40c:       78 1b 62 7c     mr      r2,r3
> >>    14e410:       00 00 09 2c     cmpwi   r9,0
> >>    14e414:       4c 00 82 40     bne     14e460 <__sbrk+0x60>
> >>    14e418:       00 00 23 2c     cmpdi   r3,0
> >>    14e41c:       b0 00 82 40     bne     14e4cc <__sbrk+0xcc>
> >>    14e420:       a6 02 08 7c     mflr    r0
> >>    14e424:       40 00 01 f8     std     r0,64(r1)
> >>    14e428:       99 61 ff 4b     bl      1445c0 <brk>
> >>
> >> So there was a 64-bit instruction bundle at the patched offset, and that
> >> may have been the reason why ucx failed to patch properly.
> > 
> > I agree
> >   
> >> I would very much prefer if there weren't any libraries like ucx in
> >> Fedora that patch glibc merely because you link against them.  It's fine
> >> to do this for debugging tools, but as part of regular execution, it
> >> risks too much breakage.
> > 
> > thanks, Florian, for the insight
> > 
> > IMO this issue is also causing the openmpi build failure in the
> > mass-rebuild
> > - https://koji.fedoraproject.org/koji/taskinfo?taskID=135241418
> > 
> > 
> >             Dan
> 
> I'm planning on dropping ucx support in openmpi on ppc64le:
> https://src.fedoraproject.org/rpms/openmpi/pull-request/24

ack, makes sense
 
> This should hopefully fix a lot of FTBFS issues with MPI using packages.
> 
> Thank you very much for the detailed analysis - I would not have known 
> where to start.

right, it is a weird one ...


                Dan
-- 
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

Reply via email to