[jira] [Commented] (TS-2497) Failed post results in tunnel buffers being returned to freelist prematurely

Feifei Cai (JIRA) Fri, 23 Jan 2015 04:51:13 -0800

    [ 
https://issues.apache.org/jira/browse/TS-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289206#comment-14289206
 ]


Feifei Cai commented on TS-2497:
--------------------------------

Memory leak is noticed in our production hosts. It should be related to 
handling 5xx response from origin sever.

The dump info is as follows, it's from 1 host with ~70% POST requests. I 
enabled memory dump {{proxy.config.dump_mem_info_frequency}} and track 
{{proxy.config.res_track_memory}}.

*traffic.out:*
{noformat}
     allocated      |        in-use      | type size  |   free list name
--------------------|--------------------|------------|----------------------------------
                  0 |                  0 |    2097152 | 
memory/ioBufAllocator[14]
                  0 |                  0 |    1048576 | 
memory/ioBufAllocator[13]
                  0 |                  0 |     524288 | 
memory/ioBufAllocator[12]
                  0 |                  0 |     262144 | 
memory/ioBufAllocator[11]
                  0 |                  0 |     131072 | 
memory/ioBufAllocator[10]
                  0 |                  0 |      65536 | memory/ioBufAllocator[9]
         1266679808 |         1262354432 |      32768 | memory/ioBufAllocator[8]
          600309760 |          599703552 |      16384 | memory/ioBufAllocator[7]
          395051008 |          391086080 |       8192 | memory/ioBufAllocator[6]
          229113856 |          224432128 |       4096 | memory/ioBufAllocator[5]
          342622208 |          342503424 |       2048 | memory/ioBufAllocator[4]
          245104640 |          245042176 |       1024 | memory/ioBufAllocator[3]
            2228224 |            2176512 |        512 | memory/ioBufAllocator[2]
             622592 |             607232 |        256 | memory/ioBufAllocator[1]
            2375680 |            2370176 |        128 | memory/ioBufAllocator[0]
                                          Location |          Size In-use
---------------------------------------------------+------------------------
  memory/IOBuffer/ProtocolProbeSessionAccept.cc:39 |             66768896
          memory/IOBuffer/HttpClientSession.cc:230 |                    0
                    memory/IOBuffer/HttpSM.cc:3314 |                    0
                    memory/IOBuffer/HttpSM.cc:5349 |           3003506816
                    memory/IOBuffer/HttpSM.cc:5668 |                    0
                    memory/IOBuffer/HttpSM.cc:5874 |                    0
                    memory/IOBuffer/HttpSM.cc:5976 |                    0
                    memory/IOBuffer/HttpSM.cc:6267 |                    0
           memory/IOBuffer/HttpServerSession.cc:87 |                    0
                  memory/IOBuffer/HttpTunnel.cc:95 |                    0
                 memory/IOBuffer/HttpTunnel.cc:100 |                    0
                                             TOTAL |           3070275712
{noformat}


I take a refer to [~shaunmcginnity]'s node.js with some changes, and reproduce 
the memory leak in my local environment.

# origin-server.js
This origin server responses a 503 when receives more than one single byte, so 
the post would not complete at most cases. I change [~shaunmcginnity]'s code, 
make origin server responses to ats, which would make ats hits another code 
path.
# client.js
We create a new client per second, and each client try to post 32K bytes data.
# ATS
*remap.config*: remap all to local port 5000
{quote}map / http://127.0.0.1:5000{quote}
*records.config*: listen on 80
{quote}CONFIG proxy.config.http.server_ports STRING 80{quote}

Then we can get dump info as follows, and in-use number of MIOBuffer with 
index=8 (size=32K) would increase 1 per second.

{noformat}
     allocated      |        in-use      | type size  |   free list name
--------------------|--------------------|------------|----------------------------------
            1048576 |              32768 |      32768 | memory/ioBufAllocator[8]
{noformat}

We can also try change the "Content-Length" in client.js to a smaller size, and 
MIOBuffer with the corresponding index(0-7) would also increase.


I add this simple patch to prevent the memory leak in the case above, just like 
last commit, and it's verified in 1 test host.

free.diff
{code}
diff --git a/proxy/http/HttpSM.cc b/proxy/http/HttpSM.cc
index 932ef97..123b97a 100644
--- a/proxy/http/HttpSM.cc
+++ b/proxy/http/HttpSM.cc
@@ -5074,6 +5074,7 @@ HttpSM::handle_post_failure()
   t_state.current.server->keep_alive = HTTP_NO_KEEPALIVE;

   if (server_buffer_reader->read_avail() > 0) {
+    tunnel.deallocate_buffers();
     tunnel.reset();
     // There's data from the server so try to read the header
     setup_server_read_response_header();
{code}

*traffic.out*
{noformat}
     allocated      |        in-use      | type size  |   free list name
--------------------|--------------------|------------|----------------------------------
                  0 |                  0 |    2097152 | 
memory/ioBufAllocator[14]
                  0 |                  0 |    1048576 | 
memory/ioBufAllocator[13]
                  0 |                  0 |     524288 | 
memory/ioBufAllocator[12]
                  0 |                  0 |     262144 | 
memory/ioBufAllocator[11]
                  0 |                  0 |     131072 | 
memory/ioBufAllocator[10]
                  0 |                  0 |      65536 | memory/ioBufAllocator[9]
            6291456 |            1474560 |      32768 | memory/ioBufAllocator[8]
             524288 |              32768 |      16384 | memory/ioBufAllocator[7]
            4456448 |            1949696 |       8192 | memory/ioBufAllocator[6]
            6291456 |            2674688 |       4096 | memory/ioBufAllocator[5]
             262144 |               4096 |       2048 | memory/ioBufAllocator[4]
             131072 |               8192 |       1024 | memory/ioBufAllocator[3]
              65536 |                  0 |        512 | memory/ioBufAllocator[2]
              32768 |                512 |        256 | memory/ioBufAllocator[1]
              16384 |               1280 |        128 | memory/ioBufAllocator[0]
{noformat}

However, this patch is actually trying to revert TS-2497 here, in my 
understanding. So, I'm not doing the right job, and it returns to the start 
point: potential crash issue. :(

I'm not very clear about the root cause of "prematurely free buffer" issue yet. 
I think we can not rely calling {{HttpTunnel::deallocate_buffers()}} in 
{{HttpSM::kill_this()}} to free the {{post_buffer}} we allocated in 
{{HttpSM::do_setup_post_tunnel}}. In the test case above, server does not 
receive all post data, but response to ats(mostly a 5xx) instead of 
disconnected, so we hit the {{if}} condition in 
{{HttpSM::handle_post_failure()}}. Then ats transfers the 5xx response back to 
client, it resets tunnel and reuses it, but ats does not reuse buffer, it 
allocates a new when needed. So, when HttpSM calls kill_this to clean up tunnel 
at last, it frees the buffer allocated in {{HttpSM::setup_server_transfer()}}, 
not the {{post_buffer}} in {{HttpSM::do_setup_post_tunnel}}.
We can enable {{proxy.config.res_track_memory}} for a track, or debug and break 
on each call of HttpTunnel::deallocate_buffers() to look at MIOBuffer's 
location.

BTW, TS-1425 is also related with this ticket here, both are trying to handle 
failed post correctly.




> Failed post results in tunnel buffers being returned to freelist prematurely
> ----------------------------------------------------------------------------
>
>                 Key: TS-2497
>                 URL: https://issues.apache.org/jira/browse/TS-2497
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Core
>            Reporter: Brian Geffon
>            Assignee: Brian Geffon
>             Fix For: 4.2.0
>
>         Attachments: TS-2497.patch, client.js, origin-server.js, repro.js
>
>
> When a post fails to an origin server either the server died or the server 
> returned a response without reading all of the post data, in either case, TS 
> will destroy buffers too early. This normally does not result in a crash 
> because the MIOBuffers are returned to the freelist and only with sufficient 
> load will the race happen causing a crash. Additionally, even if a crash 
> doesn't happen you might have data corruption across post requests from the 
> buffers being used after being returned to the freelist.
> Thanks to Thomas Jackson for help reproducing and resolving this bug.
> An example stack trace, while we've seen other crashes in write_avail too.
> #0  0x00000000004eff14 in IOBufferBlock::read_avail (this=0x0) at 
> ../iocore/eventsystem/I_IOBuffer.h:362
> #1  0x000000000050d151 in MIOBuffer::append_block_internal 
> (this=0x2aab38001130, b=0x2aab0c037200) at 
> ../iocore/eventsystem/P_IOBuffer.h:946
> #2  0x000000000050d39b in MIOBuffer::append_block (this=0x2aab38001130, 
> asize_index=15) at ../iocore/eventsystem/P_IOBuffer.h:986
> #3  0x000000000050d49b in MIOBuffer::add_block (this=0x2aab38001130) at 
> ../iocore/eventsystem/P_IOBuffer.h:994
> #4  0x000000000055cee2 in MIOBuffer::check_add_block (this=0x2aab38001130) at 
> ../iocore/eventsystem/P_IOBuffer.h:1002
> #5  0x000000000055d115 in MIOBuffer::write_avail (this=0x2aab38001130) at 
> ../iocore/eventsystem/P_IOBuffer.h:1048
> #6  0x00000000006c18f3 in read_from_net (nh=0x2aaafca0d208, 
> vc=0x2aab1c009140, thread=0x2aaafca0a010) at UnixNetVConnection.cc:234
> #7  0x00000000006c37bf in UnixNetVConnection::net_read_io 
> (this=0x2aab1c009140, nh=0x2aaafca0d208, lthread=0x2aaafca0a010) at 
> UnixNetVConnection.cc:816
> #8  0x00000000006be392 in NetHandler::mainNetEvent (this=0x2aaafca0d208, 
> event=5, e=0x271d8e0) at UnixNet.cc:380
> #9  0x00000000004f05c4 in Continuation::handleEvent (this=0x2aaafca0d208, 
> event=5, data=0x271d8e0) at ../iocore/eventsystem/I_Continuation.h:146
> #10 0x00000000006e361e in EThread::process_event (this=0x2aaafca0a010, 
> e=0x271d8e0, calling_code=5) at UnixEThread.cc:142
> #11 0x00000000006e3b13 in EThread::execute (this=0x2aaafca0a010) at 
> UnixEThread.cc:264
> #12 0x00000000006e290b in spawn_thread_internal (a=0x2716400) at Thread.cc:88
> #13 0x0000003372c077e1 in start_thread () from /lib64/libpthread.so.0
> #14 0x00000033728e68ed in clone () from /lib64/libc.so.6



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TS-2497) Failed post results in tunnel buffers being returned to freelist prematurely

Reply via email to