On 7/28/25 03:12, Dan Horák wrote:
On Mon, 28 Jul 2025 10:40:03 +0200
Florian Weimer <fwei...@redhat.com> wrote:
* Dan Horák:
On Sun, 27 Jul 2025 21:34:12 +0200
Sandro Mani <manisan...@gmail.com> wrote:
Hi
scotch is currently FTBFS on ppc64le (affects the current 7.0.7, the
previous 7.0.6, as well as the new 7.0.8 release), failing with [1]
gmake[2]: *** [src/libscotch/CMakeFiles/ptscotchf_h.dir/build.make:77:
src/include/ptscotchf.h] Illegal instruction (core dumped)
7.0.6 previously successfully built with gcc-0:15.0.1-0.3.fc42.1.ppc64le and
now fails with gcc-0:15.1.1-5.fc43.1.ppc64le, so this looks like a gcc
regression.
Being this on ppc64le and having no access to such a machine, how can I debug
this?
you have access to a system from
https://fedoraproject.org/wiki/Test_Machine_Resources_For_Package_Maintainers
and it can be reproduced there. But it needs the Power10 system (same
as current koji builders), not the Power9 (pre-DC-migration koji
builders, it builds there OK).
ppc64le-redhat-linux-gnu-openmpi/src/libscotch/ptdummysizes is the
crashing binary ...
and running it under gdb gives
...
Program received signal SIGILL, Illegal instruction.
0x00007ffff774e404 in sbrk () from /lib64/glibc-hwcaps/power10/libc.so.6
(gdb) where
#0 0x00007ffff774e404 in sbrk () from /lib64/glibc-hwcaps/power10/libc.so.6
#1 0x00007ffff787b38c in ucm_fire_mmap_events_internal () from
/lib64/libucm.so.0
#2 0x00007ffff787bd88 in ucm_mmap_test_events_nolock () from /lib64/libucm.so.0
#3 0x00007ffff78818b8 in ucm_mmap_install () from /lib64/libucm.so.0
#4 0x00007ffff7881b30 in ucm_mmap_init () from /lib64/libucm.so.0
#5 0x00007ffff7881c2c in ucm_library_init () from /lib64/libucm.so.0
#6 0x00007ffff7881cbc in ucm_set_global_opts () from /lib64/libucm.so.0
#7 0x00007ffff725745c in ucs_init_ucm_opts () from /lib64/libucs.so.0
#8 0x00007ffff7243fb0 in ucs_init () from /lib64/libucs.so.0
#9 0x00007ffff7f989bc in call_init (l=<optimized out>, argc=1,
argv=0x7fffffffece8, env=0x7fffffffecf8) at dl-init.c:74
#10 _dl_init (main_map=0x7ffff7ff12f0, argc=1, argv=0x7fffffffece8,
env=0x7fffffffecf8) at dl-init.c:121
#11 0x00007ffff7fc3eb8 in _dl_start_user () from /lib64/ld64.so.2
The location of the crash:
Dump of assembler code for function __GI___sbrk:
0x00007ffff774e400 <+0>: d1 ff 21 f8 stdu r1,-48(r1)
=> 0x00007ffff774e404 <+4>: 0e 00 10 06 .long 0x610000e
0x00007ffff774e408 <+8>: 00 00 60 3d lis r11,0
0x00007ffff774e40c <+12>: ff 7f 6b 61 ori r11,r11,32767
0x00007ffff774e410 <+16>: c7 07 6b 79 sldi. r11,r11,32
0x00007ffff774e414 <+20>: 87 f7 6b 65 oris r11,r11,63367
0x00007ffff774e418 <+24>: b8 9e 6b 61 ori r11,r11,40632
0x00007ffff774e41c <+28>: a6 03 69 7d mtctr r11
0x00007ffff774e420 <+32>: 20 04 80 4e bctr
0x00007ffff774e424 <+36>: 40 00 01 f8 std r0,64(r1)
0x00007ffff774e428 <+40>: 99 61 ff 4b bl 0x7ffff77445c0 <__brk>
This was patched by the ucx library.
The original looks like this:
000000000014e400 <__sbrk>:
14e400: d1 ff 21 f8 stdu r1,-48(r1)
14e404: 0e 00 10 06 plbz r9,961781 # 2390f9
<__libc_initial>
14e408: f5 ac 20 89
14e40c: 78 1b 62 7c mr r2,r3
14e410: 00 00 09 2c cmpwi r9,0
14e414: 4c 00 82 40 bne 14e460 <__sbrk+0x60>
14e418: 00 00 23 2c cmpdi r3,0
14e41c: b0 00 82 40 bne 14e4cc <__sbrk+0xcc>
14e420: a6 02 08 7c mflr r0
14e424: 40 00 01 f8 std r0,64(r1)
14e428: 99 61 ff 4b bl 1445c0 <brk>
So there was a 64-bit instruction bundle at the patched offset, and that
may have been the reason why ucx failed to patch properly.
I agree
I would very much prefer if there weren't any libraries like ucx in
Fedora that patch glibc merely because you link against them. It's fine
to do this for debugging tools, but as part of regular execution, it
risks too much breakage.
thanks, Florian, for the insight
IMO this issue is also causing the openmpi build failure in the
mass-rebuild
- https://koji.fedoraproject.org/koji/taskinfo?taskID=135241418
Dan
I'm planning on dropping ucx support in openmpi on ppc64le:
https://src.fedoraproject.org/rpms/openmpi/pull-request/24
This should hopefully fix a lot of FTBFS issues with MPI using packages.
Thank you very much for the detailed analysis - I would not have known
where to start.
Orion
--
Orion Poplawski
he/him/his - surely the least important thing about me
IT Systems Manager 720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane or...@nwra.com
Boulder, CO 80301 https://www.nwra.com/
--
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct:
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives:
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
Do not reply to spam, report it:
https://pagure.io/fedora-infrastructure/new_issue