[ 
https://issues.apache.org/jira/browse/TS-3438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14493170#comment-14493170
 ] 

Sudheer Vinukonda commented on TS-3438:
---------------------------------------

Finally managed to generate a core when the npn list is corrupted. After 
analyzing the core (basically, an assert on a hard-coded npn list's length for 
34 (since, the list used in our prod is fixed - 
{{"\bspdy/3.1\006spdy/3\bhttp/1.1\bhttp/1.0"}}) and it seems the npn end point 
list, npn set etc are all fine, but, the length is messed up (*68* instead of 
*34*)). 

After further thinking, it seems to me that, it's possible that two threads 
(basically, two ssl accepted connections) may try to create the npn list 
simultaneously (since, both see the list is null) and the length could be 
messed up due to interleaving. We need to either use a mutex to protect the 
critical section or create the npn list not on the first accept, but, before 
any accept is done. Will try to use a mutex and add logs to confirm that this 
is indeed the issue.

Below's some data from gdb:

{code}
(gdb) f 7
#7  0x0000000000729209 in SSLNextProtocolSet::advertiseProtocols 
(this=0x46961518, out=0x2b7722a14830, len=0x2b7722a1483c, hash=883090498) at 
SSLNextProtocolSet.cc:126
126     SSLNextProtocolSet.cc: No such file or directory.
(gdb) down
#6  0x0000000000728f90 in create_npn_advertisement (endpoints=..., 
npn=0x46961518, len=0x46961528) at SSLNextProtocolSet.cc:73
73      in SSLNextProtocolSet.cc
(gdb) p *len
$2 = 68
(gdb) p *npn
$3 = (unsigned char *) 0x2b77d0006fc0 "\bspdy/3.1\006spdy/3\bhttp/1.1\bhttp/1.0"
(gdb) p endpoints
$4 = (const SSLNextProtocolSet::NextProtocolEndpoint::list_type &) @0x46961538: 
{head = 0x46339820}
(gdb) p endpoints.head
$5 = (SSLNextProtocolSet::NextProtocolEndpoint *) 0x46339820
(gdb) p *endpoints.head
$6 = {protocol = 0x7d80d0 "spdy/3.1", endpoint = 0x46961550, link = 
{<SLink<SSLNextProtocolSet::NextProtocolEndpoint>> = {next = 0x466c8d10}, prev 
= 0x0}}
(gdb) p *endpoints.head.link.next
$7 = {protocol = 0x7d80c0 "spdy/3", endpoint = 0x2490400, link = 
{<SLink<SSLNextProtocolSet::NextProtocolEndpoint>> = {next = 0x46282ff0}, prev 
= 0x46339820}}
(gdb) p *endpoints.head.link.next.link.next
$8 = {protocol = 0x7d8088 "http/1.1", endpoint = 0x46961450, link = 
{<SLink<SSLNextProtocolSet::NextProtocolEndpoint>> = {next = 0x45fc2140}, prev 
= 0x466c8d10}}
(gdb) p *endpoints.head.link.next.link.next.link.next
$9 = {protocol = 0x7d8070 "http/1.0", endpoint = 0x46961450, link = 
{<SLink<SSLNextProtocolSet::NextProtocolEndpoint>> = {next = 0x0}, prev = 
0x46282ff0}}
(gdb) p *endpoints.head.link.next.link.next.link.next.link.next
Cannot access memory at address 0x0
{code}


Below's the traceback:
{code}
Program terminated with signal SIGABRT, Aborted.
#0  0x0000003b98832625 in raise () from /lib64/libc.so.6
(gdb) bt
#0  0x0000003b98832625 in raise () from /lib64/libc.so.6
#1  0x0000003b98833e05 in abort () from /lib64/libc.so.6
#2  0x00002b762b915018 in ink_die_die_die (retval=1) at ink_error.cc:43
#3  0x00002b762b9150e5 in ink_fatal_va(int, const char *, typedef __va_list_tag 
__va_list_tag *) (return_code=1, message_format=0x2b762b923d78 "%s:%d: failed 
assert `%s`", ap=0x2b7722a14630)
    at ink_error.cc:65
#4  0x00002b762b9151ae in ink_fatal (return_code=1, 
message_format=0x2b762b923d78 "%s:%d: failed assert `%s`") at ink_error.cc:73
#5  0x00002b762b913d20 in _ink_assert (expression=0x7d0fed "*len <= 34", 
file=0x7d0fa0 "SSLNextProtocolSet.cc", line=73) at ink_assert.cc:37
#6  0x0000000000728f90 in create_npn_advertisement (endpoints=..., 
npn=0x46961518, len=0x46961528) at SSLNextProtocolSet.cc:73
#7  0x0000000000729209 in SSLNextProtocolSet::advertiseProtocols 
(this=0x46961518, out=0x2b7722a14830, len=0x2b7722a1483c, hash=883090498) at 
SSLNextProtocolSet.cc:126
#8  0x00000000007275c2 in SSLNetVConnection::advertise_next_protocol 
(ssl=0x2b77d0000900, out=0x2b7722a14830, outlen=0x2b7722a1483c) at 
SSLNetVConnection.cc:807
#9  0x00000038cb833cf8 in ssl_add_serverhello_tlsext () from 
/usr/lib64/libssl.so.10
#10 0x00000038cb81f786 in ssl3_send_server_hello () from /usr/lib64/libssl.so.10
#11 0x00000038cb821d78 in ssl3_accept () from /usr/lib64/libssl.so.10
#12 0x00000038cb82f846 in ssl23_get_client_hello () from /usr/lib64/libssl.so.10
#13 0x00000038cb830143 in ssl23_accept () from /usr/lib64/libssl.so.10
#14 0x0000000000726a0b in SSLNetVConnection::sslServerHandShakeEvent 
(this=0x2b77800182c0, err=@0x2b7722a14b6c: 11127) at SSLNetVConnection.cc:607
#15 0x0000000000726937 in SSLNetVConnection::sslStartHandShake 
(this=0x2b77800182c0, event=0, err=@0x2b7722a14b6c: 11127) at 
SSLNetVConnection.cc:583
#16 0x0000000000725577 in SSLNetVConnection::net_read_io (this=0x2b77800182c0, 
nh=0x2b7721100cb0, lthread=0x2b77210fd010) at SSLNetVConnection.cc:256
#17 0x000000000073312e in NetHandler::mainNetEvent (this=0x2b7721100cb0, 
event=5, e=0x1d609d0) at UnixNet.cc:515
#18 0x00000000004f7098 in Continuation::handleEvent (this=0x2b7721100cb0, 
event=5, data=0x1d609d0) at ../iocore/eventsystem/I_Continuation.h:146
#19 0x000000000075a1fe in EThread::process_event (this=0x2b77210fd010, 
e=0x1d609d0, calling_code=5) at UnixEThread.cc:145
#20 0x000000000075a708 in EThread::execute (this=0x2b77210fd010) at 
UnixEThread.cc:269
#21 0x000000000075977c in spawn_thread_internal (a=0x2168830) at Thread.cc:88
#22 0x00002b762be949d1 in start_thread () from /lib64/libpthread.so.0
#23 0x0000003b988e88fd in clone () from /lib64/libc.so.6
{code}

> npn list being created with garbage data
> ----------------------------------------
>
>                 Key: TS-3438
>                 URL: https://issues.apache.org/jira/browse/TS-3438
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: SSL
>    Affects Versions: 5.3.0
>            Reporter: Sudheer Vinukonda
>            Assignee: Sudheer Vinukonda
>             Fix For: 6.0.0
>
>
> We noticed that occasionally upon a restart, the npn list built by ATS has 
> garbage data (additional null bytes), resulting in the client failing to 
> decode the ServerHello correctly. The problem would always go away with a 
> restart and is hard to reproduce.
> Here's a trace that shows the good vs bad npn lists in the ServerHello:
> bad npn list (with spdy and http enabled):
> {code}
>             Extension: next_protocol_negotiation
>                 Type: next_protocol_negotiation (0x3374)
>                 Length: 36
>                 Next Protocol Negotiation
>                     Protocol string length: 8
>                     Next Protocol: spdy/3.1
>                     Protocol string length: 6
>                     Next Protocol: spdy/3
>                     Protocol string length: 8
>                     Next Protocol: http/1.1
>                     Protocol string length: 8
>                     Next Protocol: http/1.0
>                     Protocol string length: 0
>                     Protocol string length: 0
> {code}
> good npn list:
> {code}
>             Extension: next_protocol_negotiation
>                 Type: next_protocol_negotiation (0x3374)
>                 Length: 34
>                 Next Protocol Negotiation
>                     Protocol string length: 8
>                     Next Protocol: spdy/3.1
>                     Protocol string length: 6
>                     Next Protocol: spdy/3
>                     Protocol string length: 8
>                     Next Protocol: http/1.1
>                     Protocol string length: 8
>                     Next Protocol: http/1.0
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to