Hello Kurt Deschler, Zoltan Martonka, Attila Bukor, Kudu Jenkins, Abhishek
Chennaka,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/23925
to look at the new patch set (#2).
Change subject: KUDU-3736 fix SIGSEGV in codegen with libgcc-11.5.0-10+
......................................................................
KUDU-3736 fix SIGSEGV in codegen with libgcc-11.5.0-10+
Former libgcc versions used mutex-protected linked list to store
information on EH frames. To address scalability bottlenecks of the
former implementation of the __{register,deregister}_frame() API,
the newer implementation found in unwind-fde-dw2.c [1] and related
headers has switched to a read-optimized b-tree [2]. The b-tree
uses the start address of the memory range as the key.
While the lock and the related bottleneck is now gone, the b-tree-based
implementation is heavily reliant on a few invariants regarding the
properties of frames/sections being (de)registered. In particular,
there is now the range invariant (a.k.a. "the non-overlapping rule")
which the new implementation heavily relies upon when updating and
rebalancing the b-tree upon insertion/deletion of elements (at least,
this rule is referred to in the commit description at [3]):
* There must not be two frames that have overlapping address ranges
While the old linked list implementation might have tolerated when this
rule wasn't honored, but the new b-tree implementation is susceptible to
undefined behavior (UB) in such a cases because the tree logic assumes
assumes the presence of clear range boundaries for its separators.
As it turns out, this applies not only to a pair of particular frame
description entries (FDEs), but also to the whole range span of FDEs
that come after corresponding CIE entry in the data structure that
is supplied to __{register,deregister}_frame() invocations. In
particular, in classify_object_over_fdes() the span is calculated by
going over each FDE, so get_pc_range() returns the range spanning from
min(beginning of all FDEs) up to max(end of all FDEs).
In its turn, the implementation of RuntimeDyld's SectionMemoryManager
in LLVM uses mmap() with MAP_ANON and MAP_PRIVATE flags to allocate
memory for jitted object sections, including the .eh_frame section.
That results on quite arbitrary placement of the mmap-ed/allocated
memory ranges since the placement of the allocated ranges isn't
controlled by the application beyond providing an 'address hint' for
the placement of the newly allocated memory range, and the kernel is
free to find _any_ range that it finds as appropriate given the size
of the newly requested range if its first attempt to establish the new
mapping at the closest memory page boundary fails because there is an
existing memory mapping at that address already [4].
Since the address space of a running process might become fragmented
when there are many jitted code references alive (their handles are kept
in the codegen's cache in addition to keeping references by scanners,
compaction operations, etc.), it's possible to end up in a memory layout
that's illustrated below:
objA sections: [.....][...][...]
objB sections: [.....] [...][...]
The .eh_frame contents for the section layout above wouldn't comply
with the "non-overlapping rule" for the FDEs span, and the new libgcc
implementation could end up with receiving SIGSEGV in an attempt to
register .eh_frame section for one of the objects.
The situation described above manifests itself in Kudu tablet servers
crashing with SIGSEGV on RHEL9 with libgcc of 11.5.0-11 version. The
updated libgcc package of 11.5.0-11 version came with updates pushed
into the RedHat's package repos along with releasing RHEL9.7 on
2025-11-12 (November 11, 2025). From the libgcc's package changelog [5]
it's clear that RedHat's libgcc switched to the b-tree EH frame
implementation since 11.5.0-10, and most likely it's here to stay.
This patch addresses the issue of Kudu tablet servers crashing in
codegen by providing a custom implementation of the section memory
manager (JITFrameManager class) to be used by LLVM's MCJIT execution
engine. In essence, it reserves the memory area for subsequent
allocations of codegenned object's sections, so they are all adjacent
and localized in the pre-allocated memory area. This approach
guarantees that all the FDEs sent to libgcc's __register_frame() cannot
interleave with any other FDEs and spans of their boundary ranges
registered by Kudu tablet server during its life cycle.
[1]
https://github.com/gcc-mirror/gcc/blob/1c0305d7aea53d788f3f74ca9a2bd9fb764c0109/libgcc/unwind-dw2-fde.c
[2]
https://github.com/gcc-mirror/gcc/blob/1c0305d7aea53d788f3f74ca9a2bd9fb764c0109/libgcc/unwind-dw2-btree.h
[3]
https://gcc.gnu.org/cgit/gcc/commit/libgcc/unwind-dw2-btree.h?id=21109b37e8585a7a1b27650fcbf1749380016108
[4] https://www.man7.org/linux/man-pages/man2/mmap.2.html
[5]
https://rhel.pkgs.org/9/red-hat-ubi-baseos-x86_64/libgcc-11.5.0-11.el9.x86_64.rpm.html
Change-Id: I691d2f442c3148f235847c4c8e56767577804b1a
---
M src/kudu/codegen/jit_frame_manager.cc
M src/kudu/codegen/jit_frame_manager.h
M src/kudu/codegen/module_builder.cc
M thirdparty/download-thirdparty.sh
A thirdparty/patches/llvm-section-mm-extra-methods.patch
A thirdparty/patches/llvm-section-mm-memory-mapper.patch
6 files changed, 439 insertions(+), 58 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/25/23925/2
--
To view, visit http://gerrit.cloudera.org:8080/23925
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I691d2f442c3148f235847c4c8e56767577804b1a
Gerrit-Change-Number: 23925
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <[email protected]>
Gerrit-Reviewer: Abhishek Chennaka <[email protected]>
Gerrit-Reviewer: Attila Bukor <[email protected]>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Kurt Deschler <[email protected]>
Gerrit-Reviewer: Zoltan Martonka <[email protected]>