[
https://issues.apache.org/jira/browse/TS-3105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14195298#comment-14195298
]
Susan Hinrichs edited comment on TS-3105 at 11/3/14 11:35 PM:
--------------------------------------------------------------
Last Friday while working on the patches for 5.1, ran into the following
issues.
VC_EVENT_EOS was being delivered to consumer_handler in some cases during a
post workload. It looks like there were two cases for this.
1. The consumer's associated VC is for the HttpServerSession. The post
response is very short (one packet) . It is delivered before the second server
response tunnel is set up. Since there is no producer matching the VC, the
event is instead delivered to the consumer for the first tunnel . Fixed this
by changing the do_io_read in HttpSM::attach_server_session to read no bytes.
This is sufficient to redirect error and timeout events to the new VC handler,
but it won't start reading anything until the server response tunnel is in
place and a second do_io_read is issued in
HttpSM::setup_server_read_response_header. With this change the events from
the second tunnel will be delivered to the second tunnel's producer.
Was able to see this failure case by doing POST-based filed uploads against
test.websafedeposit.net. Didn't fail everytime, but frequently enough to debug.
While poking around in this logic, noticed that the call to do_io_read in
HttpSM::attach_client_session was passing a length of 0, but a non-null buffer.
Changed the third argument to NULL.
2. In the second case, a RESET is performed and is delivered as a VC_EVENT_EOS
I was exercising this by sending a Reset on the client side. This means that
the EOS delivered to the consumer_handler should indeed be treated as an error
case. Exercised this by writing a test client that issues a RESET after part
of the post.
Need to move these fixes to the master patch.
was (Author: shinrich):
Last Friday while working on the patches for 5.1, ran into the following
issues.
VC_EVENT_EOS was being delivered to consumer_handler in some cases during a
post workload. It looks like there were two cases for this.
1. The consumer's associated VC is for the HttpServerSession. The post
response is very short (one packet) . It is delivered before the second server
response tunnel is set up. Since there is no producer matching the VC, the
event is instead delivered to the consumer for the first tunnel . Fixed this
by changing the do_io_read in HttpSM::attach_server_session to read no bytes.
This is sufficient to redirect error and timeout events to the new VC handler,
but it won't start reading anything until the server response tunnel is in
place and a second do_io_read is issued in
HttpSM::setup_server_read_response_header. With this change the events from
the second tunnel will be delivered to the second tunnel's producer.
While poking around in this logic, noticed that the call to do_io_read in
HttpSM::attach_client_session was passing a length of 0, but a non-null buffer.
Changed the third argument to NULL.
2. In the second case, a RESET is performed and is delivered as a VC_EVENT_EOS
I was exercising this by sending a Reset on the client side. This means that
the EOS delivered to the consumer_handler should indeed be treated as an error
case.
> Combination of fixes for TS-3084 and TS-3073 causing asserts and segfaults on
> 5.1 and beyond
> --------------------------------------------------------------------------------------------
>
> Key: TS-3105
> URL: https://issues.apache.org/jira/browse/TS-3105
> Project: Traffic Server
> Issue Type: Bug
> Reporter: Susan Hinrichs
> Assignee: Susan Hinrichs
> Fix For: 5.2.0
>
> Attachments: ts-3073-and-3084-and-3105-against-510.patch,
> ts-3105-master-6.patch
>
>
> These two patches were run in a production environment on top of 5.0.1
> without problem for several weeks. Now running with these patches on top of
> 5.1 causes either an assert or a segfault. Another person has reported the
> same segfault when running master in a production environment.
> In the assert, the handler_state of the producers is 0 (UNKNOWN) rather than
> a terminal state which is expected. I'm assuming either we are being
> directed into the terminal state from a connection that terminates too
> quickly. Or an event has hung around for too long and is being executed
> against the state machine after it has been recycled.
> The event is HTTP_TUNNEL_EVENT_DONE
> The assert stack trace is
> FATAL: HttpSM.cc:2632: failed assert `0`
> /z/bin/traffic_server - STACK TRACE:
> /z/lib/libtsutil.so.5(+0x25197)[0x2b8bd08dc197]
> /z/lib/libtsutil.so.5(+0x23def)[0x2b8bd08dadef]
> /z/bin/traffic_server(HttpSM::tunnel_handler_post_or_put(HttpTunnelProducer*)+0xcd)[0x5982ad]
> /z/bin/traffic_server(HttpSM::tunnel_handler_post(int, void*)+0x86)[0x5a32d6]
> /z/bin/traffic_server(HttpSM::main_handler(int, void*)+0xd8)[0x5a1e18]
> /z/bin/traffic_server(HttpTunnel::main_handler(int, void*)+0xee)[0x5dd6ae]
> /z/bin/traffic_server(write_to_net_io(NetHandler*, UnixNetVConnection*,
> EThread*)+0x136e)[0x721d1e]
> /z/bin/traffic_server(NetHandler::mainNetEvent(int, Event*)+0x28c)[0x7162fc]
> /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x744df1]
> /z/bin/traffic_server(EThread::execute()+0x4fc)[0x7458ac]
> /z/bin/traffic_server[0x7440ca]
> /lib64/libpthread.so.0(+0x7034)[0x2b8bd1ee4034]
> /lib64/libc.so.6(clone+0x6d)[0x2b8bd2c2875d]
> The segfault stack trace is
> /z/bin/traffic_server - STACK TRACE:
> /lib64/libpthread.so.0(+0xf280)[0x2abccd0d8280]
> /z/bin/traffic_server(HttpSM::tunnel_handler_ua(int,
> HttpTunnelConsumer*)+0x122)[0x591462]
> /z/bin/traffic_server(HttpTunnel::consumer_handler(int,
> HttpTunnelConsumer*)+0x9e)[0x5dd15e]
> /z/bin/traffic_server(HttpTunnel::main_handler(int, void*)+0x117)[0x5dd6d7]
> /z/bin/traffic_server(UnixNetVConnection::mainEvent(int,
> Event*)+0x3f0)[0x725190]
> /z/bin/traffic_server(InactivityCop::check_inactivity(int,
> Event*)+0x275)[0x716b75]
> /z/bin/traffic_server(EThread::process_event(Event*, int)+0x91)[0x744df1]
> /z/bin/traffic_server(EThread::execute()+0x2fb)[0x7456ab]
> /z/bin/traffic_server[0x7440ca]
> /lib64/libpthread.so.0(+0x7034)[0x2abccd0d0034]
> /lib64/libc.so.6(clone+0x6d)[0x2abccde1475d]
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)