[
https://issues.apache.org/jira/browse/TS-2497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14289206#comment-14289206
]
Feifei Cai edited comment on TS-2497 at 1/23/15 12:53 PM:
----------------------------------------------------------
Memory leak is noticed in our production hosts. It should be related to
handling 5xx response from origin sever.
The dump info is as follows, it's from 1 host with ~70% POST requests. I
enabled memory dump {{proxy.config.dump_mem_info_frequency}} and track
{{proxy.config.res_track_memory}}.
*traffic.out:*
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
0 | 0 | 2097152 |
memory/ioBufAllocator[14]
0 | 0 | 1048576 |
memory/ioBufAllocator[13]
0 | 0 | 524288 |
memory/ioBufAllocator[12]
0 | 0 | 262144 |
memory/ioBufAllocator[11]
0 | 0 | 131072 |
memory/ioBufAllocator[10]
0 | 0 | 65536 | memory/ioBufAllocator[9]
1266679808 | 1262354432 | 32768 | memory/ioBufAllocator[8]
600309760 | 599703552 | 16384 | memory/ioBufAllocator[7]
395051008 | 391086080 | 8192 | memory/ioBufAllocator[6]
229113856 | 224432128 | 4096 | memory/ioBufAllocator[5]
342622208 | 342503424 | 2048 | memory/ioBufAllocator[4]
245104640 | 245042176 | 1024 | memory/ioBufAllocator[3]
2228224 | 2176512 | 512 | memory/ioBufAllocator[2]
622592 | 607232 | 256 | memory/ioBufAllocator[1]
2375680 | 2370176 | 128 | memory/ioBufAllocator[0]
Location | Size In-use
---------------------------------------------------+------------------------
memory/IOBuffer/ProtocolProbeSessionAccept.cc:39 | 66768896
memory/IOBuffer/HttpClientSession.cc:230 | 0
memory/IOBuffer/HttpSM.cc:3314 | 0
memory/IOBuffer/HttpSM.cc:5349 | 3003506816
memory/IOBuffer/HttpSM.cc:5668 | 0
memory/IOBuffer/HttpSM.cc:5874 | 0
memory/IOBuffer/HttpSM.cc:5976 | 0
memory/IOBuffer/HttpSM.cc:6267 | 0
memory/IOBuffer/HttpServerSession.cc:87 | 0
memory/IOBuffer/HttpTunnel.cc:95 | 0
memory/IOBuffer/HttpTunnel.cc:100 | 0
TOTAL | 3070275712
{noformat}
I take a refer to [~shaunmcginnity]'s node.js with some changes, and reproduce
the memory leak in my local environment.
#
[origin-server.js|https://issues.apache.org/jira/secure/attachment/12694146/client.js]
This origin server responses a 503 when receives more than one single byte, so
the post would not complete at most cases. I change [~shaunmcginnity]'s code,
make origin server responses to ats, which would make ats hits another code
path.
#
[client.js|https://issues.apache.org/jira/secure/attachment/12694145/origin-server.js]
We create a new client per second, and each client try to post 32K bytes data.
# ats
*remap.config*: remap all to local port 5000
{quote}map / http://127.0.0.1:5000{quote}
*records.config*: listen on 80
{quote}CONFIG proxy.config.http.server_ports STRING 80{quote}
Then we can get dump info as follows, and in-use number of MIOBuffer with
index=8 (size=32K) would increase 1 per second.
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
1048576 | 32768 | 32768 | memory/ioBufAllocator[8]
{noformat}
We can also try change the "Content-Length" in client.js to a smaller size, and
MIOBuffer with the corresponding index(0-7) would also increase.
I add this simple patch to prevent the memory leak in the case above, just like
last commit, and it's verified in 1 test host.
free.diff
{code}
diff --git a/proxy/http/HttpSM.cc b/proxy/http/HttpSM.cc
index 932ef97..123b97a 100644
--- a/proxy/http/HttpSM.cc
+++ b/proxy/http/HttpSM.cc
@@ -5074,6 +5074,7 @@ HttpSM::handle_post_failure()
t_state.current.server->keep_alive = HTTP_NO_KEEPALIVE;
if (server_buffer_reader->read_avail() > 0) {
+ tunnel.deallocate_buffers();
tunnel.reset();
// There's data from the server so try to read the header
setup_server_read_response_header();
{code}
*traffic.out*
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
0 | 0 | 2097152 |
memory/ioBufAllocator[14]
0 | 0 | 1048576 |
memory/ioBufAllocator[13]
0 | 0 | 524288 |
memory/ioBufAllocator[12]
0 | 0 | 262144 |
memory/ioBufAllocator[11]
0 | 0 | 131072 |
memory/ioBufAllocator[10]
0 | 0 | 65536 | memory/ioBufAllocator[9]
6291456 | 1474560 | 32768 | memory/ioBufAllocator[8]
524288 | 32768 | 16384 | memory/ioBufAllocator[7]
4456448 | 1949696 | 8192 | memory/ioBufAllocator[6]
6291456 | 2674688 | 4096 | memory/ioBufAllocator[5]
262144 | 4096 | 2048 | memory/ioBufAllocator[4]
131072 | 8192 | 1024 | memory/ioBufAllocator[3]
65536 | 0 | 512 | memory/ioBufAllocator[2]
32768 | 512 | 256 | memory/ioBufAllocator[1]
16384 | 1280 | 128 | memory/ioBufAllocator[0]
{noformat}
However, this patch is actually trying to revert TS-2497 here, in my
understanding. So, I'm not doing the right job, and it returns to the start
point: potential crash issue. :(
I'm not very clear about the root cause of "prematurely free buffer" issue yet.
I think we can not rely calling {{HttpTunnel::deallocate_buffers()}} in
{{HttpSM::kill_this()}} to free the {{post_buffer}} we allocated in
{{HttpSM::do_setup_post_tunnel}}. In the test case above, server does not
receive all post data, but response to ats(mostly a 5xx) instead of
disconnected, so we hit the {{if}} condition in
{{HttpSM::handle_post_failure()}}. Then ats transfers the 5xx response back to
client, it resets tunnel and reuses it, but ats does not reuse buffer, it
allocates a new when needed. So, when HttpSM calls kill_this to clean up tunnel
at last, it frees the buffer allocated in {{HttpSM::setup_server_transfer()}},
not the {{post_buffer}} in {{HttpSM::do_setup_post_tunnel}}.
We can enable {{proxy.config.res_track_memory}} for a track, or debug and break
on each call of HttpTunnel::deallocate_buffers() to look at MIOBuffer's
location.
BTW, TS-1425 is also related with this ticket here, both are trying to handle
failed post correctly.
was (Author: ffcai):
Memory leak is noticed in our production hosts. It should be related to
handling 5xx response from origin sever.
The dump info is as follows, it's from 1 host with ~70% POST requests. I
enabled memory dump {{proxy.config.dump_mem_info_frequency}} and track
{{proxy.config.res_track_memory}}.
*traffic.out:*
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
0 | 0 | 2097152 |
memory/ioBufAllocator[14]
0 | 0 | 1048576 |
memory/ioBufAllocator[13]
0 | 0 | 524288 |
memory/ioBufAllocator[12]
0 | 0 | 262144 |
memory/ioBufAllocator[11]
0 | 0 | 131072 |
memory/ioBufAllocator[10]
0 | 0 | 65536 | memory/ioBufAllocator[9]
1266679808 | 1262354432 | 32768 | memory/ioBufAllocator[8]
600309760 | 599703552 | 16384 | memory/ioBufAllocator[7]
395051008 | 391086080 | 8192 | memory/ioBufAllocator[6]
229113856 | 224432128 | 4096 | memory/ioBufAllocator[5]
342622208 | 342503424 | 2048 | memory/ioBufAllocator[4]
245104640 | 245042176 | 1024 | memory/ioBufAllocator[3]
2228224 | 2176512 | 512 | memory/ioBufAllocator[2]
622592 | 607232 | 256 | memory/ioBufAllocator[1]
2375680 | 2370176 | 128 | memory/ioBufAllocator[0]
Location | Size In-use
---------------------------------------------------+------------------------
memory/IOBuffer/ProtocolProbeSessionAccept.cc:39 | 66768896
memory/IOBuffer/HttpClientSession.cc:230 | 0
memory/IOBuffer/HttpSM.cc:3314 | 0
memory/IOBuffer/HttpSM.cc:5349 | 3003506816
memory/IOBuffer/HttpSM.cc:5668 | 0
memory/IOBuffer/HttpSM.cc:5874 | 0
memory/IOBuffer/HttpSM.cc:5976 | 0
memory/IOBuffer/HttpSM.cc:6267 | 0
memory/IOBuffer/HttpServerSession.cc:87 | 0
memory/IOBuffer/HttpTunnel.cc:95 | 0
memory/IOBuffer/HttpTunnel.cc:100 | 0
TOTAL | 3070275712
{noformat}
I take a refer to [~shaunmcginnity]'s node.js with some changes, and reproduce
the memory leak in my local environment.
# origin-server.js
This origin server responses a 503 when receives more than one single byte, so
the post would not complete at most cases. I change [~shaunmcginnity]'s code,
make origin server responses to ats, which would make ats hits another code
path.
# client.js
We create a new client per second, and each client try to post 32K bytes data.
# ATS
*remap.config*: remap all to local port 5000
{quote}map / http://127.0.0.1:5000{quote}
*records.config*: listen on 80
{quote}CONFIG proxy.config.http.server_ports STRING 80{quote}
Then we can get dump info as follows, and in-use number of MIOBuffer with
index=8 (size=32K) would increase 1 per second.
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
1048576 | 32768 | 32768 | memory/ioBufAllocator[8]
{noformat}
We can also try change the "Content-Length" in client.js to a smaller size, and
MIOBuffer with the corresponding index(0-7) would also increase.
I add this simple patch to prevent the memory leak in the case above, just like
last commit, and it's verified in 1 test host.
free.diff
{code}
diff --git a/proxy/http/HttpSM.cc b/proxy/http/HttpSM.cc
index 932ef97..123b97a 100644
--- a/proxy/http/HttpSM.cc
+++ b/proxy/http/HttpSM.cc
@@ -5074,6 +5074,7 @@ HttpSM::handle_post_failure()
t_state.current.server->keep_alive = HTTP_NO_KEEPALIVE;
if (server_buffer_reader->read_avail() > 0) {
+ tunnel.deallocate_buffers();
tunnel.reset();
// There's data from the server so try to read the header
setup_server_read_response_header();
{code}
*traffic.out*
{noformat}
allocated | in-use | type size | free list name
--------------------|--------------------|------------|----------------------------------
0 | 0 | 2097152 |
memory/ioBufAllocator[14]
0 | 0 | 1048576 |
memory/ioBufAllocator[13]
0 | 0 | 524288 |
memory/ioBufAllocator[12]
0 | 0 | 262144 |
memory/ioBufAllocator[11]
0 | 0 | 131072 |
memory/ioBufAllocator[10]
0 | 0 | 65536 | memory/ioBufAllocator[9]
6291456 | 1474560 | 32768 | memory/ioBufAllocator[8]
524288 | 32768 | 16384 | memory/ioBufAllocator[7]
4456448 | 1949696 | 8192 | memory/ioBufAllocator[6]
6291456 | 2674688 | 4096 | memory/ioBufAllocator[5]
262144 | 4096 | 2048 | memory/ioBufAllocator[4]
131072 | 8192 | 1024 | memory/ioBufAllocator[3]
65536 | 0 | 512 | memory/ioBufAllocator[2]
32768 | 512 | 256 | memory/ioBufAllocator[1]
16384 | 1280 | 128 | memory/ioBufAllocator[0]
{noformat}
However, this patch is actually trying to revert TS-2497 here, in my
understanding. So, I'm not doing the right job, and it returns to the start
point: potential crash issue. :(
I'm not very clear about the root cause of "prematurely free buffer" issue yet.
I think we can not rely calling {{HttpTunnel::deallocate_buffers()}} in
{{HttpSM::kill_this()}} to free the {{post_buffer}} we allocated in
{{HttpSM::do_setup_post_tunnel}}. In the test case above, server does not
receive all post data, but response to ats(mostly a 5xx) instead of
disconnected, so we hit the {{if}} condition in
{{HttpSM::handle_post_failure()}}. Then ats transfers the 5xx response back to
client, it resets tunnel and reuses it, but ats does not reuse buffer, it
allocates a new when needed. So, when HttpSM calls kill_this to clean up tunnel
at last, it frees the buffer allocated in {{HttpSM::setup_server_transfer()}},
not the {{post_buffer}} in {{HttpSM::do_setup_post_tunnel}}.
We can enable {{proxy.config.res_track_memory}} for a track, or debug and break
on each call of HttpTunnel::deallocate_buffers() to look at MIOBuffer's
location.
BTW, TS-1425 is also related with this ticket here, both are trying to handle
failed post correctly.
> Failed post results in tunnel buffers being returned to freelist prematurely
> ----------------------------------------------------------------------------
>
> Key: TS-2497
> URL: https://issues.apache.org/jira/browse/TS-2497
> Project: Traffic Server
> Issue Type: Bug
> Components: Core
> Reporter: Brian Geffon
> Assignee: Brian Geffon
> Fix For: 4.2.0
>
> Attachments: TS-2497.patch, client.js, origin-server.js, repro.js
>
>
> When a post fails to an origin server either the server died or the server
> returned a response without reading all of the post data, in either case, TS
> will destroy buffers too early. This normally does not result in a crash
> because the MIOBuffers are returned to the freelist and only with sufficient
> load will the race happen causing a crash. Additionally, even if a crash
> doesn't happen you might have data corruption across post requests from the
> buffers being used after being returned to the freelist.
> Thanks to Thomas Jackson for help reproducing and resolving this bug.
> An example stack trace, while we've seen other crashes in write_avail too.
> #0 0x00000000004eff14 in IOBufferBlock::read_avail (this=0x0) at
> ../iocore/eventsystem/I_IOBuffer.h:362
> #1 0x000000000050d151 in MIOBuffer::append_block_internal
> (this=0x2aab38001130, b=0x2aab0c037200) at
> ../iocore/eventsystem/P_IOBuffer.h:946
> #2 0x000000000050d39b in MIOBuffer::append_block (this=0x2aab38001130,
> asize_index=15) at ../iocore/eventsystem/P_IOBuffer.h:986
> #3 0x000000000050d49b in MIOBuffer::add_block (this=0x2aab38001130) at
> ../iocore/eventsystem/P_IOBuffer.h:994
> #4 0x000000000055cee2 in MIOBuffer::check_add_block (this=0x2aab38001130) at
> ../iocore/eventsystem/P_IOBuffer.h:1002
> #5 0x000000000055d115 in MIOBuffer::write_avail (this=0x2aab38001130) at
> ../iocore/eventsystem/P_IOBuffer.h:1048
> #6 0x00000000006c18f3 in read_from_net (nh=0x2aaafca0d208,
> vc=0x2aab1c009140, thread=0x2aaafca0a010) at UnixNetVConnection.cc:234
> #7 0x00000000006c37bf in UnixNetVConnection::net_read_io
> (this=0x2aab1c009140, nh=0x2aaafca0d208, lthread=0x2aaafca0a010) at
> UnixNetVConnection.cc:816
> #8 0x00000000006be392 in NetHandler::mainNetEvent (this=0x2aaafca0d208,
> event=5, e=0x271d8e0) at UnixNet.cc:380
> #9 0x00000000004f05c4 in Continuation::handleEvent (this=0x2aaafca0d208,
> event=5, data=0x271d8e0) at ../iocore/eventsystem/I_Continuation.h:146
> #10 0x00000000006e361e in EThread::process_event (this=0x2aaafca0a010,
> e=0x271d8e0, calling_code=5) at UnixEThread.cc:142
> #11 0x00000000006e3b13 in EThread::execute (this=0x2aaafca0a010) at
> UnixEThread.cc:264
> #12 0x00000000006e290b in spawn_thread_internal (a=0x2716400) at Thread.cc:88
> #13 0x0000003372c077e1 in start_thread () from /lib64/libpthread.so.0
> #14 0x00000033728e68ed in clone () from /lib64/libc.so.6
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)