Some thoughts from a colleague:

When dl_lookup_symbol gets called, it doesn't really know what it's getting 
into--it doesn't know what processor model the stack frames around it were 
built with, it doesn't necessarily even know what registers the processor has. 
So there's a rule that ld.so/rtld et al. shouldn't touch the xmm registers or 
any of the other registers beyond the base set.  There's a glibc test, 
tst-xmmymm.sh, which checks this.  There's also a recent bug 
report,https://sourceware.org/bugzilla/show_bug.cgi?id=15627 , that gcc-4.8 can 
vectorize such things as memset (with -O3 or -ftree-vectorize), so rtld needed 
to add it's own assembler version of memset that doesn't touch the SSE 
registers.  It looks like this is basically the same thing; presumably 
tst-xmmymm.sh would fail if we ran it against ld.so linked with jemalloc built 
with gcc 4.8.

glibc contains tst-xmmymm.sh

Initial commit message states:

    Make sure no code in ld.so uses xmm/ymm registers on x86-64.

    This patch introduces a test to make sure no function modifies the
    xmm/ymm registers.  With the exception of the auditing functions.

Looks like jemalloc breaks the rule by using SSE registers.

david

On Nov 9, 2013, at 5:30 PM, David Abdurachmanov wrote:

> Hi,
> 
> I am having problems with jemalloc 3.4.1 (currently we use 2.2.2 in 
> production). I found that with jemalloc 3.4.1 function first argument will be 
> changed if first argument is passed by XMM0 register. Compiled with GCC 4.8.1 
> (tested also with 4.8.2). No problems on Scientific Linux 6 (RHEL6-based), 
> but it fails on Scientific Linux 5 (RHEL5-based). All of this is because 
> _dl_lookup_symbol_x calls _realloc_ in Scientific Linux 5.
> 
> This probably makes jemalloc 3.4.1 and the whole 3.X.Y series not recommended 
> for RHEL5 and RHEL5-based distributions.
> 
> Original email below.
> 
> - - - - - - -
> 
> My initial investigations were done on slc6_amd64_gcc481 and the release is 
> available for slc5_amd64_gcc481.
> 
> Most of the workflows will fail on this [slc5_amd64_gcc481] architecture, 
> while on slc6_amd64_gcc481 all workflows pass.
> 
> If you are interested into the cause and calling conventions continue reading.
> 
> Most workflows fails with:
> 
> ----- Begin Fatal Exception 08-Nov-2013 14:19:25 CET-----------------------
> An exception of category 'InvalidIntervalError' occurred while
>   [0] Processing run: 208307 lumi: 1 event: 643482
>   [1] Running path 'reconstruction_step'
>   [2] Calling event method for module 
> TrackIPProducer/'impactParameterTagInfos'
> Exception Message:
> Upper boundary below lower boundary in histogram integral.
> ----- End Fatal Exception -------------------------------------------------
> 
> Code triggering exception 
> (CondFormats/PhysicsToolsObjects/interface/Histogram.icc):
> 
> 244 template<typename Value_t, typename Axis_t>
> 245 Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t lBound,
> 246                                              int mode) const
> 247 {
> 248         if (hBound < lBound)
> 249                 throw cms::Exception("InvalidIntervalError")
> 250                         << "Upper boundary below lower boundary in "
> 251                         << "histogram integral." << std::endl;
> 
> The problem by example (description below):
> 
> Dump of assembler code for function 
> PhysicsTools::Calibration::Histogram<float, float>::normalizedIntegral(float, 
> float, int) const:
>   0x00002aaabc67ceb0 <+0>:     push   %rbx
>   0x00002aaabc67ceb1 <+1>:     mov    %rdi,%rbx
>   0x00002aaabc67ceb4 <+4>:     sub    $0x10,%rsp
>   0x00002aaabc67ceb8 <+8>:     callq  0x2aaabc6331e0 
> <_ZNK12PhysicsTools11Calibration9HistogramIffE8integralEffi@plt>
>   0x00002aaabc67cebd <+13>:    mov    %rbx,%rdi
>   0x00002aaabc67cec0 <+16>:    movss  %xmm0,0xc(%rsp)
>   0x00002aaabc67cec6 <+22>:    callq  0x2aaabc632c80 
> <_ZNK12PhysicsTools11Calibration9HistogramIffE13normalizationEv@plt>
>   0x00002aaabc67cecb <+27>:    movss  0xc(%rsp),%xmm1
>   0x00002aaabc67ced1 <+33>:    add    $0x10,%rsp
>   0x00002aaabc67ced5 <+37>:    divss  %xmm0,%xmm1
>   0x00002aaabc67ced9 <+41>:    pop    %rbx
>   0x00002aaabc67ceda <+42>:    movaps %xmm1,%xmm0
>   0x00002aaabc67cedd <+45>:    retq   
> End of assembler dump.
> this = 0x2aab170a9ff0
> hBound = 57.6329994
> lBound = 0
> mode = 1
> 
> Breakpoint 1, PhysicsTools::Calibration::Histogram<float, float>::integral 
> (this=0x2aab170a9ff0, hBound=-2.23135843e-10, lBound=0, mode=1)
>    at 
> /build/davidlt/CMSSW_7_0_0_pre8_jemalloc341/src/CondFormats/PhysicsToolsObjects/interface/Histogram.icc:245
> 245     Value_t Histogram<Value_t, Axis_t>::integral(Axis_t hBound, Axis_t 
> lBound,
> 1: x/i $pc
> => 0x2aaabc67cbdc <PhysicsTools::Calibration::Histogram<float, 
> float>::integral(float, float, int) const>:      push   %r14
> this = 0x2aab170a9ff0
> hBound = -2.23135843e-10
> lBound = 0
> mode = 1
> 
> KA-BOOM! 
> 
> _normalizedIntegral_ calls _integral_ with IDENTICAL arguments, yet once we 
> reach _integral_ body our _hBound_ is changed to a different value.
> 
> We call _integral_ via PLT and we try to resolve the symbol 
> (/lib64/ld-linux-x86-64.so.2). Between these two functions while we are 
> resolving the symbol the value is modified.
> 
> That happens in _dl_lookup_symbol_x (/lib64/ld-linux-x86-64.so.2) as on SLC5 
> is calls _realloc_, and on SLC6 library calls _malloc_. This is the reason 
> why in works fine under SLC6, the change in dynamic linker/loader.
> 
> _hBound_ is stored in $xmm0.v4_float[0]. It happens to be that in _realloc_ 
> (jemalloc) for this (src/jemalloc.c):
> 
> 1244     ta->allocated += usize;
> 
> 1244 line compiler will generate SSE based code (using $xmm0).
> 
>   0x00002aaaad381666 <+630>:   mov    %r12,0x28(%rsp)
>   0x00002aaaad38166b <+635>:   movq   0x28(%rsp),%xmm0
>   0x00002aaaad381671 <+641>:   movhps 0x20(%rsp),%xmm0
>   0x00002aaaad381676 <+646>:   paddq  (%rax),%xmm0
>   0x00002aaaad38167a <+650>:   movdqa %xmm0,(%rax)
>   0x00002aaaad38167e <+654>:   add    $0x38,%rsp 
> 
> Just a few instructions which modify _hBound_ value.
> 
> Old value = 57.6329994
> New value = 6.72623263e-44
> 0x00002aaaad381671 in realloc (ptr=<optimized out>, size=<optimized out>) at 
> src/jemalloc.c:1244
> 1244  src/jemalloc.c: No such file or directory.
> 1: x/i $pc
> => 0x2aaaad381671 <realloc+641>:      movhps 0x20(%rsp),%xmm0
> Continuing.
> Watchpoint 7: $xmm0.v4_float[0]
> 
> Old value = 6.72623263e-44
> New value = -2.22548424e-10
> 0x00002aaaad38167a in realloc (ptr=<optimized out>, size=<optimized out>) at 
> src/jemalloc.c:1244
> 1244  in src/jemalloc.c
> 1: x/i $pc
> => 0x2aaaad38167a <realloc+650>:      movdqa %xmm0,(%rax)
> Continuing.
> 
> If you look into "Calling conventions for different C++ compilers and 
> operating systems". (I assume should be fine for C also, as they are 
> compatible).
> 
> 64-bit Linux. Callee-saved registers: RBX, RBP, R12-R15. All fine in jemallo 
> _realloc_:
> 
> Dump of assembler code for function realloc:
>   0x00002aaaad3803f0 <+0>:     push   %r15
>   0x00002aaaad3803f2 <+2>:     push   %r14
>   0x00002aaaad3803f4 <+4>:     push   %r13
>   0x00002aaaad3803f6 <+6>:     push   %r12
>   0x00002aaaad3803f8 <+8>:     push   %rbp
>   0x00002aaaad3803f9 <+9>:     mov    %rsi,%rbp
>   0x00002aaaad3803fc <+12>:    push   %rbx
> 
> But all other registers are scratch registers.
> 
> Also looking into "System V Application Binary Interface AMD64 Architecture 
> Processor Supplement" (October 7, 2013) [3.2.1 section]
> 
> Registers %rbp, %rbx and %r12 through %r15 "belong" to the calling function 
> and the called function is required to preserve their values. In other words, 
> a called function must preserve these registers' values for its caller. 
> Remaining registers "belong" to the called function. If a calling function 
> wants to preserve such a register value across a function call, it must save 
> the value in its local stack frame.
> 
> Simply put, according to this /lib64/ld-linux-x86-64.so.2 dynamic 
> linker/loader (_dl_lookup_symbol_x) before calling _realloc_ had to take the 
> action to protect xmm0 register value.
> 
> You cannot compile jemalloc without SSE:
> 
> include/jemalloc/internal/prof.h:349:40: error: SSE register return with SSE 
> disabled
> 
> If we cannot jemalloc from using SSE registers, how can we go around the 
> problem?
> 
> 1240   if (config_stats && ret != NULL) {
> 1241     thread_allocated_t *ta;
> 1242     assert(usize == isalloc(ret, config_prof));
> 1243     ta = thread_allocated_tsd_get();
> 1244     ta->allocated += usize;
> 1245     ta->deallocated += old_size;
> 1246   }
> 
> In _realloc_ 1244 line is wrapped around if with config_stats. Compiling 
> jemalloc with --disable-stats options disables statistic collection, should 
> also slightly increase performance.
> 
> It's a bit worrisome that arguments can change in between function calls.
> 
> david

_______________________________________________
jemalloc-discuss mailing list
[email protected]
http://www.canonware.com/mailman/listinfo/jemalloc-discuss

Reply via email to