[
https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714138#comment-13714138
]
David Carlin commented on TS-1487:
----------------------------------
While troubleshooting TS-2051, traffic server crashed after about 22 mins under
very light load only listening on 443 for SSL traffic. When it crashed, I got
two core dumps - one from an SSL thread that looked like the others in TS-2051
and a new one from a NET thread that I hadn't seen before:
Alan said on IRC it was related to TS-1487
{quote}
Core was generated by `/home/y/bin/traffic_server -M --httpport 443:fd=9:ssl'.
Program terminated with signal 11, Segmentation fault.
#0 APIHooks::get (this=0x28) at InkAPI.cc:1246
1246 InkAPI.cc: No such file or directory.
in InkAPI.cc
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64
hwloc-1.5-1.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64
krb5-libs-1.10.3-10.el6_4.2.x86_64 libattr-2.4.44-7.el6.x86_64
libcap-2.16-5.5.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64
libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64
libstdc++-4.4.7-3.el6.x86_64 libxml2-2.7.6-12.el6_4.1.x86_64
nss-softokn-freebl-3.12.9-11.el6.x86_64 numactl-2.0.7-6.el6.x86_64
openssl-1.0.0-27.el6_4.2.x86_64 pciutils-libs-3.1.10-2.el6.x86_64
pcre-7.8-6.el6.x86_64 tcl-8.5.7-6.el6.x86_64
xz-libs-4.999.9-0.3.beta.20091007git.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0 APIHooks::get (this=0x28) at InkAPI.cc:1246
#1 0x00000000004c1485 in get () at InkAPIInternal.h:214
#2 CB_After_Cache_Init () at Main.cc:456
#3 0x000000000062ada8 in Cache::open_done (this=0x2b760800f560) at
Cache.cc:1987
#4 0x000000000062b385 in vol_initialized (this=0x2b7608058010) at Cache.cc:1858
#5 Vol::dir_init_done (this=0x2b7608058010) at Cache.cc:1729
#6 0x00000000005eb1a5 in handleEvent (this=<value optimized out>, event=<value
optimized out>, data=<value optimized out>) at
../../iocore/eventsystem/I_Continuation.h:146
#7 AIOCallbackInternal::io_complete (this=<value optimized out>, event=<value
optimized out>, data=<value optimized out>) at ../../iocore/aio/P_AIO.h:123
#8 0x00000000006a1aff in handleEvent (this=0x2b75ea1ec010, e=0x1e215f0,
calling_code=1) at I_Continuation.h:146
#9 EThread::process_event (this=0x2b75ea1ec010, e=0x1e215f0, calling_code=1)
at UnixEThread.cc:141
#10 0x00000000006a267b in EThread::execute (this=0x2b75ea1ec010) at
UnixEThread.cc:192
#11 0x00000000006a099a in spawn_thread_internal (a=0x1cf4ae0) at Thread.cc:88
#12 0x00002b75e7976851 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003f820e890d in clone () from /lib64/libc.so.6
{quote}
> the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core
> endlessly
> ---------------------------------------------------------------------------------------
>
> Key: TS-1487
> URL: https://issues.apache.org/jira/browse/TS-1487
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Affects Versions: 3.2.0
> Environment: Linux RHEL6.2
> Reporter: Aidan McGurn
> Assignee: Alan M. Carroll
> Priority: Critical
> Labels: A
> Fix For: 3.3.5
>
> Attachments: INTD-529-RespawnCrash.patch,
> INTD-529-RespawnCrash.patch, ts-1487.diff
>
>
> We've had a serious issue whereby the TS when it crashes re-spawns/cores
> continuously when its tries to re-start under load. I traced the issue to
> SNMP research library (a third party lib)- They use selects and what happens
> is the file descriptor number spikes under load after the crash as all the
> sockets get opened at once - this causes buffer overflow in the select (which
> their library is full of) as the fd allocated to the FD_SET is much bigger
> than the FD_SETSIZE of 1024 (which was a bitch to track down as the stack
> was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0
> and not 3.0.2, I find the sequence
> of the plugin_init has changed - On 3.0.2 the sequence was in effect 1.
> plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously
> been reversed on 3.2.0. In order to get our system to work in this crash case
> , I've patched ATS to flip them around like in 3.0.2.
> i'll attach the patch we propose we need to use to get around this.
> Is this actually a bug then waiting to happen in other systems - Or was there
> a reason to change this sequence?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira