[ 
https://issues.apache.org/jira/browse/TS-1487?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13714138#comment-13714138
 ] 

David Carlin commented on TS-1487:
----------------------------------

While troubleshooting TS-2051, traffic server crashed after about 22 mins under 
very light load only listening on 443 for SSL traffic.  When it crashed, I got 
two core dumps - one from an SSL thread that looked like the others in TS-2051 
and a new one from a NET thread that I hadn't seen before:

Alan said on IRC it was related to TS-1487

{quote}
Core was generated by `/home/y/bin/traffic_server -M --httpport 443:fd=9:ssl'.
Program terminated with signal 11, Segmentation fault.
#0  APIHooks::get (this=0x28) at InkAPI.cc:1246
1246    InkAPI.cc: No such file or directory.
        in InkAPI.cc
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.107.el6.x86_64 
hwloc-1.5-1.el6.x86_64 keyutils-libs-1.4-4.el6.x86_64 
krb5-libs-1.10.3-10.el6_4.2.x86_64 libattr-2.4.44-7.el6.x86_64 
libcap-2.16-5.5.el6.x86_64 libcom_err-1.41.12-14.el6.x86_64 
libgcc-4.4.7-3.el6.x86_64 libselinux-2.0.94-5.3.el6_4.1.x86_64 
libstdc++-4.4.7-3.el6.x86_64 libxml2-2.7.6-12.el6_4.1.x86_64 
nss-softokn-freebl-3.12.9-11.el6.x86_64 numactl-2.0.7-6.el6.x86_64 
openssl-1.0.0-27.el6_4.2.x86_64 pciutils-libs-3.1.10-2.el6.x86_64 
pcre-7.8-6.el6.x86_64 tcl-8.5.7-6.el6.x86_64 
xz-libs-4.999.9-0.3.beta.20091007git.el6.x86_64 zlib-1.2.3-29.el6.x86_64
(gdb) bt
#0  APIHooks::get (this=0x28) at InkAPI.cc:1246
#1  0x00000000004c1485 in get () at InkAPIInternal.h:214
#2  CB_After_Cache_Init () at Main.cc:456
#3  0x000000000062ada8 in Cache::open_done (this=0x2b760800f560) at 
Cache.cc:1987
#4  0x000000000062b385 in vol_initialized (this=0x2b7608058010) at Cache.cc:1858
#5  Vol::dir_init_done (this=0x2b7608058010) at Cache.cc:1729
#6  0x00000000005eb1a5 in handleEvent (this=<value optimized out>, event=<value 
optimized out>, data=<value optimized out>) at 
../../iocore/eventsystem/I_Continuation.h:146
#7  AIOCallbackInternal::io_complete (this=<value optimized out>, event=<value 
optimized out>, data=<value optimized out>) at ../../iocore/aio/P_AIO.h:123
#8  0x00000000006a1aff in handleEvent (this=0x2b75ea1ec010, e=0x1e215f0, 
calling_code=1) at I_Continuation.h:146
#9  EThread::process_event (this=0x2b75ea1ec010, e=0x1e215f0, calling_code=1) 
at UnixEThread.cc:141
#10 0x00000000006a267b in EThread::execute (this=0x2b75ea1ec010) at 
UnixEThread.cc:192
#11 0x00000000006a099a in spawn_thread_internal (a=0x1cf4ae0) at Thread.cc:88
#12 0x00002b75e7976851 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003f820e890d in clone () from /lib64/libc.so.6
{quote}
                
> the ordering of plugin_init and init_HttpProxyServer cause crashed TS to core 
> endlessly
> ---------------------------------------------------------------------------------------
>
>                 Key: TS-1487
>                 URL: https://issues.apache.org/jira/browse/TS-1487
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 3.2.0
>         Environment: Linux RHEL6.2
>            Reporter: Aidan McGurn
>            Assignee: Alan M. Carroll
>            Priority: Critical
>              Labels: A
>             Fix For: 3.3.5
>
>         Attachments: INTD-529-RespawnCrash.patch, 
> INTD-529-RespawnCrash.patch, ts-1487.diff
>
>
> We've had a serious issue whereby the TS when it crashes re-spawns/cores 
> continuously when its tries to re-start under load. I traced the issue to 
> SNMP research library (a third party lib)- They use selects and what happens 
> is the file descriptor number spikes under load after the crash as all the 
> sockets get opened at once - this causes buffer overflow in the select (which 
> their library is full of) as the fd allocated to the FD_SET is much bigger 
> than the FD_SETSIZE of 1024 (which  was a bitch to track down as the stack 
> was corrupted and gdb therefore useless). Tracing why this happened on 3.2.0 
> and not 3.0.2, I find the sequence 
> of the plugin_init has changed - On 3.0.2 the sequence was in effect  1. 
> plugin_init and then 2. init_HttpProxyServer. Whereas this has mysteriously 
> been reversed on 3.2.0. In order to get our system to work in this crash case 
> , I've patched ATS to flip them around like in 3.0.2.
> i'll attach the patch we propose we need to use to get around this.
> Is this actually a bug then waiting to happen in other systems - Or was there 
> a reason to change this sequence?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to