[
https://issues.apache.org/jira/browse/TS-852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13058974#comment-13058974
]
AdunGaos commented on TS-852:
-----------------------------
We have do lots of work at this issue, found some information about it:
Default:
{code}
proxy.config.dns.lookup_timeout=20
proxy.config.hostdb.ttl_mode=0
proxy.config.hostdb.verify_after=720
{code}
Our dns ttl is 300 seconds, this cause when a hostdb entry timeout, it will
send a dns requst to dns system, ATS CAN NOT DO ANYTHING about this domain's
request until dns system return an answer. BUT a dns request may be timeout in
some status. this cause ATS DON'T WORK a long
time(proxy.config.dns.lookup_timeout=20). In our test, about 3000qps, when this
issue appear, will cause ATS stack so many request, even after dns timeout, a
dns retry answer right, OLD requst take a long time to dequeue from hostdb. so
it seems like hostdb hungup.
DNS requst may timeout because next issue:
Our nameserver is 127.0.0.1, it will answer requst so quickly less than 0.4ms.
When a thread send dns requst, and it not ready for receive answer, the answer
received by anthoer thread, It NOT FOUND correct DNS id from
dnsProcessor.handler, cause this answer droped, until dns timeout, it do a
retry.
Next log is our set proxy.config.dns.lookup_timeout=2:
step 1. send request:
{code}
[Jun 26 03:29:12.714] Server {1103427904} DEBUG: (dns) received query
img.taobaocdn.com type = 1, timeout = ,0
[Jun 26 03:29:12.714] Server {1103427904} DEBUG: (dns) enqueing query
img.taobaocdn.com
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) adding first to
collapsing queue
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) send query for
img.taobaocdn.com to fd 51
{code}
step 2. answer received by an other thread but drop it.
{code}
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) received packet size
= 157
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) primary DNS response
code = 0
[Jun 26 03:29:12.715] Server {47576670086256} DEBUG: (dns) unknown DNS id =
62816
{code}
step 3. thread 1 sent finished after answer has been droped.
{code}
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) sent qname =
img.taobaocdn.com, id = 62816, nameserver = 0
[Jun 26 03:29:12.715] Server {1103427904} DEBUG: (dns) sent_one:
failover_number for resolver 0 is 1
{code}
step 4. timeout for query, do a retry and get right answer this time.
{code}
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) timeout for query
img.taobaocdn.com
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) marking
img.taobaocdn.com as not-written
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) doing retry for
img.taobaocdn.com
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) send query for
img.taobaocdn.com to fd 51
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) sent qname =
img.taobaocdn.com, id = 39679, nameserver = 0
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) sent_one:
failover_number for resolver 0 is 2
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received packet size
= 157
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) primary DNS response
code = 0
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received cname =
taobao
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A name =
taobao
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A =
122.224.194.211
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A =
122.224.194.212
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A =
110.75.3.141
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) received A =
110.75.3.142
[Jun 26 03:29:14.710] Server {47576670086256} DEBUG: (dns) SUCCESS result for
img.taobaocdn.com = 122.224.194.211 retry 0
[Jun 26 03:29:14.710] Server {1103427904} DEBUG: (dns) called back continuation
for img.taobaocdn.com
{code}
We do a small patch(change some code's order) about this issue, it seems this
behavior not appeared again. Any one can take a look at this patch, declare
this patch is right or no, or it may has other issue?
P.S. after this patch, TS-833 seems gone. I don't clear how this two issue
seems is related.
> hostdb and dns system hangup
> ----------------------------
>
> Key: TS-852
> URL: https://issues.apache.org/jira/browse/TS-852
> Project: Traffic Server
> Issue Type: Bug
> Components: Core, DNS
> Affects Versions: 3.1.0
> Reporter: Zhao Yongming
> Priority: Critical
> Labels: dns, hostdb
> Fix For: 3.1.0
>
>
> in my testing, all request that need go OS, fails with 30s timeout, and the
> slow query shows data from dns missed:
> {code}
> [Jun 22 15:33:47.183] Server {1106880832} ERROR: [8122411] Slow Request: url:
> http://download.im.alisoft.com/aliim/AliIM6/update/packages//2011-5-4-10-10-40/0/modulesproxy/8203.xml.wwzip
> status: 0 unique id: bytes: 0 fd: 0 client state: 6 server state: 0
> ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000
> cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000
> server_connect: -1.000 server_first_read: -1.000 server_read_header_done:
> -1.000 server_close: -1.000 ua_close: 30.667 sm_finish: 30.667
> [Jun 22 15:33:47.220] Server {1099663680} ERROR: [8122422] Slow Request: url:
> http://img.uu1001.cn/materials/original/2010-11-22/22-48/a302876929a9c40a8272ac439a16ad25e74edf71.png
> status: 0 unique id: bytes: 0 fd: 0 client state: 6 server state: 0
> ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000
> cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000
> server_connect: -1.000 server_first_read: -1.000 server_read_header_done:
> -1.000 server_close: -1.000 ua_close: 30.318 sm_finish: 30.318
> [Jun 22 15:33:47.441] Server {1098611008} ERROR: [8122421] Slow Request: url:
> http://img02.taobaocdn.com/bao/uploaded/i2/T1mp8QXopdXXblNIZ6_061203.jpg_b.jpg
> status: 0 unique id: bytes: 0 fd: 0 client state: 6 server state: 0
> ua_begin: 0.000 ua_read_header_done: 0.000 cache_open_read_begin: 0.000
> cache_open_read_end: 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000
> server_connect: -1.000 server_first_read: -1.000 server_read_header_done:
> -1.000 server_close: -1.000 ua_close: 30.597 sm_finish: 30.597
> [Jun 22 15:33:47.441] Server {1098611008} ERROR: [8122409] Slow Request: url:
> http://img04.taobaocdn.com/tps/i4/T1EM9gXlttXXXXXXXX-440-135.jpg status: 0
> unique id: bytes: 0 fd: 0 client state: 6 server state: 0 ua_begin: 0.000
> ua_read_header_done: 0.000 cache_open_read_begin: 0.000 cache_open_read_end:
> 0.000 dns_lookup_begin: 0.000 dns_lookup_end: -1.000 server_connect: -1.000
> server_first_read: -1.000 server_read_header_done: -1.000 server_close:
> -1.000 ua_close: 30.948 sm_finish: 30.948
> {code}
> all http SM show from {http} in DNS_LOOKUP
> and traffic.out show nothing with error, when I enable debug on
> hostdb.*|dns.*:
> {code}
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) probe
> img03.taobaocdn.com DB357D60B8247DC7 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) timeout 64204
> 1308663404 300
> [Jun 22 15:27:42.391] Server {1108986176} DEBUG: (hostdb) delaying force 0
> answer for img03.taobaocdn.com [timeout 0]
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) probe
> img03.taobaocdn.com DB357D60B8247DC7 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) timeout 64204
> 1308663404 300
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) DNS
> img03.taobaocdn.com
> [Jun 22 15:27:42.411] Server {1108986176} DEBUG: (hostdb) enqueuing
> additional request
> [Jun 22 15:27:42.416] Server {47177168876656} DEBUG: (hostdb) probe 127.0.0.1
> E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.416] Server {47177168876656} DEBUG: (hostdb) immediate
> answer for 127.0.0.1
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) probe
> img08.taobaocdn.com 945198AE9AE37481 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) timeout 64281
> 1308663327 300
> [Jun 22 15:27:42.422] Server {1098611008} DEBUG: (hostdb) delaying force 0
> answer for img08.taobaocdn.com [timeout 0]
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) probe
> img08.taobaocdn.com 945198AE9AE37481 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) timeout 64281
> 1308663327 300
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) DNS
> img08.taobaocdn.com
> [Jun 22 15:27:42.441] Server {1098611008} DEBUG: (hostdb) enqueuing
> additional request
> [Jun 22 15:27:42.470] Server {1099663680} DEBUG: (hostdb) probe 127.0.0.1
> E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.470] Server {1099663680} DEBUG: (hostdb) immediate answer
> for 127.0.0.1
> [Jun 22 15:27:42.490] Server {1106880832} DEBUG: (hostdb) probe 127.0.0.1
> E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.490] Server {1106880832} DEBUG: (hostdb) immediate answer
> for 127.0.0.1
> [Jun 22 15:27:42.594] Server {1108986176} DEBUG: (hostdb) probe 127.0.0.1
> E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.594] Server {1108986176} DEBUG: (hostdb) immediate answer
> for 127.0.0.1
> [Jun 22 15:27:42.818] Server {1108986176} DEBUG: (hostdb) probe 127.0.0.1
> E9B7563422C93608 1 [ignore_timeout = 0]
> [Jun 22 15:27:42.818] Server {1108986176} DEBUG: (hostdb) immediate answer
> for 127.0.0.1
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) dequeuing
> additional request
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) probe
> img.taobao.com 55D857190294CA66 1 [ignore_timeout = 0]
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) timeout 64134
> 1308663475 300
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) DNS img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) received query
> img.taobao.com type = 1, timeout = 0
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) enqueing query
> img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (dns) adding first to
> collapsing queue
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) probe
> img.taobao.com 55D857190294CA66 1 [ignore_timeout = 0]
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) timeout 64134
> 1308663475 300
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) DNS img.taobao.com
> [Jun 22 15:27:43.014] Server {1099663680} DEBUG: (hostdb) enqueuing
> additional request
> {code}
> this is a critical fail for me!
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira