Some thoughts from a colleague:
When dl_lookup_symbol gets called, it doesn't really know what it's getting
into--it doesn't know what processor model the stack frames around it were
built with, it doesn't necessarily even know what registers the processor has.
So there's a rule that ld.so/rtld et al. shouldn't touch the xmm registers or
any of the other registers beyond the base set. There's a glibc test,
tst-xmmymm.sh, which checks this. There's also a recent bug
report,https://sourceware.org/bugzilla/show_bug.cgi?id=15627 , that gcc-4.8 can
vectorize such things as memset (with -O3 or -ftree-vectorize), so rtld needed
to add it's own assembler version of memset that doesn't touch the SSE
registers. It looks like this is basically the same thing; presumably
tst-xmmymm.sh would fail if we ran it against ld.so linked with jemalloc built
with gcc 4.8.
glibc contains tst-xmmymm.sh
Initial commit message states:
Make sure no code in ld.so uses xmm/ymm registers on x86-64.
This patch introduces a test to make sure no function modifies the
xmm/ymm registers. With the exception of the auditing functions.
Looks like jemalloc breaks the rule by using SSE registers.
david
On Nov 9, 2013, at 5:30 PM, David Abdurachmanov wrote:
> Hi,
>
> I am having problems with jemalloc 3.4.1 (currently we use 2.2.2 in
> production). I found that with jemalloc 3.4.1 function first argument will be
> changed if first argument is passed by XMM0 register. Compiled with GCC 4.8.1
> (tested also with 4.8.2). No problems on Scientific Linux 6 (RHEL6-based),
> but it fails on Scientific Linux 5 (RHEL5-based). All of this is because
> _dl_lookup_symbol_x calls _realloc_ in Scientific Linux 5.
>
> This probably makes jemalloc 3.4.1 and the whole 3.X.Y series not recommended
> for RHEL5 and RHEL5-based distributions.
>
> Original email below.
>
> - - - - - - -
>
> My initial investigations were done on slc6_amd64_gcc481 and the release is
> available for slc5_amd64_gcc481.
>
> Most of the workflows will fail on this [slc5_amd64_gcc481] architecture,
> while on slc6_amd64_gcc481 all workflows pass.
>
> If you are interested into the cause and calling conventions continue reading.
>
> Most workflows fails with:
>
> ----- Begin Fatal Exception 08-Nov-2013 14:19:25 CET-----------------------
> An exception of category 'InvalidIntervalError' occurred while
> [0] Processing run: 208307 lumi: 1 event: 643482
> [1] Running path 'reconstruction_step'
> [2] Calling event method for module
> TrackIPProducer/'impactParameterTagInfos'
> Exception Message:
> Upper boundary below lower boundary in histogram integral.
> ----- End Fatal Exception -------------------------------------------------
>
> Code triggering exception
> (CondFormats/PhysicsToolsObjects/interface/Histogram.icc):
>
> 244 template<typename Value_t, typename Axis_t>
> 245 Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
> 246 int mode) const
> 247 {
> 248 if (hBound < lBound)
> 249 throw cms::Exception("InvalidIntervalError")
> 250 << "Upper boundary below lower boundary in "
> 251 << "histogram integral." << std::endl;
>
> The problem by example (description below):
>
> Dump of assembler code for function
> PhysicsTools::Calibration::Histogram<float, float>::normalizedIntegral(float,
> float, int) const:
> 0x00002aaabc67ceb0 <+0>: push %rbx
> 0x00002aaabc67ceb1 <+1>: mov %rdi,%rbx
> 0x00002aaabc67ceb4 <+4>: sub $0x10,%rsp
> 0x00002aaabc67ceb8 <+8>: callq 0x2aaabc6331e0
> <_ZNK12PhysicsTools11Calibration9HistogramIffE8integralEffi@plt>
> 0x00002aaabc67cebd <+13>: mov %rbx,%rdi
> 0x00002aaabc67cec0 <+16>: movss %xmm0,0xc(%rsp)
> 0x00002aaabc67cec6 <+22>: callq 0x2aaabc632c80
> <_ZNK12PhysicsTools11Calibration9HistogramIffE13normalizationEv@plt>
> 0x00002aaabc67cecb <+27>: movss 0xc(%rsp),%xmm1
> 0x00002aaabc67ced1 <+33>: add $0x10,%rsp
> 0x00002aaabc67ced5 <+37>: divss %xmm0,%xmm1
> 0x00002aaabc67ced9 <+41>: pop %rbx
> 0x00002aaabc67ceda <+42>: movaps %xmm1,%xmm0
> 0x00002aaabc67cedd <+45>: retq
> End of assembler dump.
> this = 0x2aab170a9ff0
> hBound = 57.6329994
> lBound = 0
> mode = 1
>
> Breakpoint 1, PhysicsTools::Calibration::Histogram<float, float>::integral
> (this=0x2aab170a9ff0, hBound=-2.23135843e-10, lBound=0, mode=1)
> at
> /build/davidlt/CMSSW_7_0_0_pre8_jemalloc341/src/CondFormats/PhysicsToolsObjects/interface/Histogram.icc:245
> 245 Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t
> lBound,
> 1: x/i $pc
> => 0x2aaabc67cbdc <PhysicsTools::Calibration::Histogram<float,
> float>::integral(float, float, int) const>: push %r14
> this = 0x2aab170a9ff0
> hBound = -2.23135843e-10
> lBound = 0
> mode = 1
>
> KA-BOOM!
>
> _normalizedIntegral_ calls _integral_ with IDENTICAL arguments, yet once we
> reach _integral_ body our _hBound_ is changed to a different value.
>
> We call _integral_ via PLT and we try to resolve the symbol
> (/lib64/ld-linux-x86-64.so.2). Between these two functions while we are
> resolving the symbol the value is modified.
>
> That happens in _dl_lookup_symbol_x (/lib64/ld-linux-x86-64.so.2) as on SLC5
> is calls _realloc_, and on SLC6 library calls _malloc_. This is the reason
> why in works fine under SLC6, the change in dynamic linker/loader.
>
> _hBound_ is stored in $xmm0.v4_float[0]. It happens to be that in _realloc_
> (jemalloc) for this (src/jemalloc.c):
>
> 1244 ta->allocated += usize;
>
> 1244 line compiler will generate SSE based code (using $xmm0).
>
> 0x00002aaaad381666 <+630>: mov %r12,0x28(%rsp)
> 0x00002aaaad38166b <+635>: movq 0x28(%rsp),%xmm0
> 0x00002aaaad381671 <+641>: movhps 0x20(%rsp),%xmm0
> 0x00002aaaad381676 <+646>: paddq (%rax),%xmm0
> 0x00002aaaad38167a <+650>: movdqa %xmm0,(%rax)
> 0x00002aaaad38167e <+654>: add $0x38,%rsp
>
> Just a few instructions which modify _hBound_ value.
>
> Old value = 57.6329994
> New value = 6.72623263e-44
> 0x00002aaaad381671 in realloc (ptr=<optimized out>, size=<optimized out>) at
> src/jemalloc.c:1244
> 1244 src/jemalloc.c: No such file or directory.
> 1: x/i $pc
> => 0x2aaaad381671 <realloc+641>: movhps 0x20(%rsp),%xmm0
> Continuing.
> Watchpoint 7: $xmm0.v4_float[0]
>
> Old value = 6.72623263e-44
> New value = -2.22548424e-10
> 0x00002aaaad38167a in realloc (ptr=<optimized out>, size=<optimized out>) at
> src/jemalloc.c:1244
> 1244 in src/jemalloc.c
> 1: x/i $pc
> => 0x2aaaad38167a <realloc+650>: movdqa %xmm0,(%rax)
> Continuing.
>
> If you look into "Calling conventions for different C++ compilers and
> operating systems". (I assume should be fine for C also, as they are
> compatible).
>
> 64-bit Linux. Callee-saved registers: RBX, RBP, R12-R15. All fine in jemallo
> _realloc_:
>
> Dump of assembler code for function realloc:
> 0x00002aaaad3803f0 <+0>: push %r15
> 0x00002aaaad3803f2 <+2>: push %r14
> 0x00002aaaad3803f4 <+4>: push %r13
> 0x00002aaaad3803f6 <+6>: push %r12
> 0x00002aaaad3803f8 <+8>: push %rbp
> 0x00002aaaad3803f9 <+9>: mov %rsi,%rbp
> 0x00002aaaad3803fc <+12>: push %rbx
>
> But all other registers are scratch registers.
>
> Also looking into "System V Application Binary Interface AMD64 Architecture
> Processor Supplement" (October 7, 2013) [3.2.1 section]
>
> Registers %rbp, %rbx and %r12 through %r15 "belong" to the calling function
> and the called function is required to preserve their values. In other words,
> a called function must preserve these registers' values for its caller.
> Remaining registers "belong" to the called function. If a calling function
> wants to preserve such a register value across a function call, it must save
> the value in its local stack frame.
>
> Simply put, according to this /lib64/ld-linux-x86-64.so.2 dynamic
> linker/loader (_dl_lookup_symbol_x) before calling _realloc_ had to take the
> action to protect xmm0 register value.
>
> You cannot compile jemalloc without SSE:
>
> include/jemalloc/internal/prof.h:349:40: error: SSE register return with SSE
> disabled
>
> If we cannot jemalloc from using SSE registers, how can we go around the
> problem?
>
> 1240 if (config_stats && ret != NULL) {
> 1241 thread_allocated_t *ta;
> 1242 assert(usize == isalloc(ret, config_prof));
> 1243 ta = thread_allocated_tsd_get();
> 1244 ta->allocated += usize;
> 1245 ta->deallocated += old_size;
> 1246 }
>
> In _realloc_ 1244 line is wrapped around if with config_stats. Compiling
> jemalloc with --disable-stats options disables statistic collection, should
> also slightly increase performance.
>
> It's a bit worrisome that arguments can change in between function calls.
>
> david
_______________________________________________
jemalloc-discuss mailing list
[email protected]
http://www.canonware.com/mailman/listinfo/jemalloc-discuss