ema opened a new issue #7423:
URL: https://github.com/apache/trafficserver/issues/7423


   While investigating suspiciously high system CPU usage on a few servers 
running ATS, we observed that the traffic_server process was causing more than 
50 thousand **mmap** calls per second, all of which with length 65536. 
   
   ```
   # pid 31316 is traffic_server
   $ sudo bpftrace -e 'kprobe:do_mmap /pid == 31316/ { @[arg2]=count(); } 
interval:s:1 { exit(); }'
   Attaching 2 probes...
   @[65536]: 64988
   ```
   By looking at LuaJIT's source code we found that the function 
**mcode_alloc_at** is responsible for calling mmap, so we  obtained a few 
backtraces for that function with the following SystemTap script:
   
   ```
   probe 
process("/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0").function("mcode_alloc_at")
 {
     print_ubacktrace()
     exit()
   }
   ```
   The script must be executed as follows to properly include all symbols:
   
   ```
   $ sudo stap -v -d /usr/bin/traffic_server -d 
/usr/lib/trafficserver/modules/tslua.so mcode_alloc_at.stp
   ```
   Most backtraces looked either like this:
   ```
    0x7fb44df01338 : mcode_allocarea+0xb8/0x160 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df19d54 : lj_asm_trace+0x25b4/0x3460 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df381d1 : trace_state+0x201/0x1340 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def2146 : lj_vm_cpcall+0x4a/0x5e 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def864b : lj_trace_ins+0x5b/0x70 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df410dc : lj_dispatch_call+0x21c/0x250 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def37ba : lj_vm_hotcall+0x1e/0x43 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df3f22c : lua_pcall+0x5c/0x80 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb52800d9a3 : globalHookHandler+0x313/0x450 
[/usr/lib/trafficserver/modules/tslua.so]
   ```
   Or like this:
   
   ```
    0x7fb44df01338 : mcode_allocarea+0xb8/0x160 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df19d54 : lj_asm_trace+0x25b4/0x3460 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df381d1 : trace_state+0x201/0x1340 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def2146 : lj_vm_cpcall+0x4a/0x5e 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def864b : lj_trace_ins+0x5b/0x70 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df34e17 : lj_dispatch_ins+0x177/0x1e0 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def3741 : lj_vm_inshook+0x2d/0x44 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44df48ec1 : lj_cf_string_gsub+0x2e1/0x440 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb44def1dd7 : lj_BC_FUNCC+0x34/0x59 
[/usr/lib/x86_64-linux-gnu/libluajit-5.1.so.2.1.0]
    0x7fb52801a0fe : ts_lua_http_cont_handler+0x15e/0x410 
[/usr/lib/trafficserver/modules/tslua.so]
   ```
   
   I think the former traces are from tslua used as a global plugin, the latter 
for remap rules.
   
   Almost all our servers were affected by this, with the exception of a 
handful were traffic_server almost never called **mmap**. We do not have any 
valid theory explaining the difference, given that the servers run the same 
software stack and configuration. 
   However, the discrepancy allowed us to observe that on systems **not** 
calling mmap, the various Lua JIT trace calls (eg: **lj_trace_ins**) were also 
not called. We hence tried to disable the JIT compiler and only rely on the 
interpreter by adding **jit.off(true, true)** to all our Lua scripts. The 
result was that indeed mmap stopped being called, system CPU usage went down 
significantly, and the performance of ATS increased visibly. 
   
   
![ats-sys-cpu](https://user-images.githubusercontent.com/281102/104574876-d4a9b500-5656-11eb-9980-4f93200de2e2.png)
   
   We think that what might be happening is that the Lua JIT compiler keeps on 
identifying hot code and compiling it, instead of retaining and reusing the 
built code, or something along those lines. It would be great to:
   
   - better understand why the compiler is kicking in so often
   - have a configuration setting to turn off the compiler, similarly to what 
luajit's **-joff** does: https://luajit.org/running.html
   
   In **plugin.config** we have:
   
   ```
   tslua.so /etc/trafficserver/lua/default.lua --enable-reload
   ```
   
   **remap.config**: https://phabricator.wikimedia.org/P13769
   
   All lua scripts 
[here](https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/ab6958815d4d346ecfcca28724cd6d91a33d9eaf/modules/profile/files/trafficserver/).
   
   Our bug tracking system has [further details on the 
investigation](https://phabricator.wikimedia.org/T265625).
   
   
   
    


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to