c-taylor opened a new issue #7378:
URL: https://github.com/apache/trafficserver/issues/7378
I wanted to share some really unusual system calls on ET_AIO threads. I
think it's odd enough to warrant further investigation as the cause/impact is
uncertain.
When looking at system calls on a running system, even with light or no
activity, the following can be observed.
```
grep "ETIMEDOUT (Connection timed out)" strace.out |head
34908 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010099>
34909 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010179>
34910 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010072>
34911 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010143>
34912 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010094>
```
The interesting thing about all of the PIDs is that they are ET_AIO threads,
which in theory manage no connections.
## Reproduction
Tested version: 8.1.x
Start traffic_server by hand:
`strace -fT -o strace.out traffic_manager &`
No 'load' is required, but I all sent through about 4-5 requests that would
typically cache.
strace capture and traffic_(server|manager) stopped after 60-120s.
### Find ET_AIO threads
e.g. `grep "PR_SET_NAME, \"\[ET_AIO" strace.out |awk '{print $1}' |sort -n
|uniq -c`
```
34803 clone(child_stack=0x7fe9109fdf30,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0x7fe9109fe9d0, tls=0x7fe9109fe700, child_tidptr=0x7fe9109fe9d0)
= 34908 <0.000046>
34803 mmap(NULL, 2097152, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0 <unfinished ...>
34908 set_robust_list(0x7fe9109fe9e0, 24) = 0 <0.000014>
34803 <... mmap resumed>) = 0x7fe90e000000 <0.000038>
34908 prctl(PR_SET_NAME, "[ET_AIO 1:131]") = 0 <0.000011>
34908 mmap(NULL, 2097152, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fe90de00000 <0.000057>
34908 set_mempolicy(MPOL_INTERLEAVE, [0x0000000000000001], 65) = 0 <0.000012>
34908 futex(0x7fe90ea631ac, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME,
2, {tv_sec=1605182243, tv_nsec=587387395}, 0xffffffff <unfinished ...>
```
### Find "Connection Timed Out"
```
$ grep "ETIMEDOUT (Connection timed out)" strace.out |head
34908 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010099>
34909 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010179>
34910 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010072>
34911 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010143>
34912 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010094>
34913 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010080>
34907 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010025>
34914 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010057>
34916 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010067>
34917 <... futex resumed>) = -1 ETIMEDOUT (Connection timed
out) <0.010072>
```
## Observations
In my own tests there is near perfect alignment on PIDs regularly reporting
"Connection timed out" and ET_AIO threads.
They also don't appear to subside, but continue even on long running, near
idle systems.
* What is the origin of these messages?
* Why is ET_AIO 'timing out' any connections?
* Did we clone more than intended?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]