[ 
https://issues.apache.org/jira/browse/TS-3395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luca Bruno updated TS-3395:
---------------------------
    Description: 
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought it could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 1000000 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 1000000 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000000
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://i.imgur.com/YgOndhY.png!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.

This is some related issue: 
http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E

Also this: 
http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html

  was:
I'm doing some tests and I've noticed that the hit ratio drops with more than 
300 simultaneous http connections.

The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
ram cache is disabled.

The test is done with web-polygraph. Content size vary from 5kb to 20kb 
uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
after months. There's no Vary.

!http://i.imgur.com/Zxlhgnf.png!

Then I thought I could be a problem of polygraph. I wrote my own client/server 
test code, it works fine also with other cache servers. I register a hit if I 
get either cR or cH in the headers.

{noformat}
2015/02/19 12:38:28 Starting 1000000 requests
2015/02/19 12:37:58 Elapsed: 3m51.23552164s
2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
2015/02/19 12:37:58 Average size: 12.50kb/req
2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
2015/02/19 12:37:58 Errors: 0
2015/02/19 12:37:58 Offered Hit ratio: 59.95%
2015/02/19 12:37:58 Measured Hit ratio: 37.20%
2015/02/19 12:37:58 Hit bytes: 4649000609
2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
{noformat}

So similar results, 37.20% on average. Then I thought that could be a problem 
of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
ratio, but request rate is very slow compared to ATS for obvious reasons.

Then I wanted to check if with 200 connections but with longer test time hit 
ratio also dropped, but no, it's fine:

!http://i.imgur.com/oMHscuf.png!

So not a problem of my tests I guess.

Then I realized by debugging the test server that the same url was asked twice.
Out of 1000000 requests, 78600 urls were asked at least twice. An url was even 
requested 9 times. These same url are not requested close to each other: even 
more than 30sec can pass from one request to the other for the same url.

I also tweaked the following parameters:

{noformat}
CONFIG proxy.config.http.cache.fuzz.time INT 0
CONFIG proxy.config.http.cache.fuzz.min_time INT 0
CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000000
CONFIG proxy.config.http.cache.max_open_read_retries INT 4
CONFIG proxy.config.http.cache.open_read_retry_time INT 500
{noformat}

And this is the result with polygraph, similar results:

!http://i.imgur.com/YgOndhY.png!

Tweaked the read-while-writer option, and yet having similar results.

Then I've enabled 1GB of ram, it is slightly better at the beginning, but then 
it drops:

!http://i.imgur.com/dFTJI16.png!

traffic_top says 25% ram hit, 37% fresh, 63% cold.

So given that it doesn't seem to be a concurrency problem when requesting the 
url to the origin server, could it be a problem of concurrent write access to 
the cache? So that some pages are not cached at all? The traffoc_top fresh 
percentage also makes me think it can be a problem in writing the cache.

Not sure if I explained the problem correctly, ask me further information in 
case. But in summary: hit ratio drops with a high number of connections, and 
the problem seems related to pages that are not written to the cache.

This is some related issue: 
http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E

Also this: 
http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html


> Hit ratio drops with high concurrency
> -------------------------------------
>
>                 Key: TS-3395
>                 URL: https://issues.apache.org/jira/browse/TS-3395
>             Project: Traffic Server
>          Issue Type: Bug
>          Components: Cache
>            Reporter: Luca Bruno
>
> I'm doing some tests and I've noticed that the hit ratio drops with more than 
> 300 simultaneous http connections.
> The cache is on a raw disk of 500gb and it's not filled, so no eviction. The 
> ram cache is disabled.
> The test is done with web-polygraph. Content size vary from 5kb to 20kb 
> uniformly, expected hit ratio 60%, 2000 http connections, documents expire 
> after months. There's no Vary.
> !http://i.imgur.com/Zxlhgnf.png!
> Then I thought it could be a problem of polygraph. I wrote my own 
> client/server test code, it works fine also with other cache servers. I 
> register a hit if I get either cR or cH in the headers.
> {noformat}
> 2015/02/19 12:38:28 Starting 1000000 requests
> 2015/02/19 12:37:58 Elapsed: 3m51.23552164s
> 2015/02/19 12:37:58 Total average: 231.235µs/req, 4324.60req/s
> 2015/02/19 12:37:58 Average size: 12.50kb/req
> 2015/02/19 12:37:58 Bytes read: 12498412.45kb, 54050.57kb/s
> 2015/02/19 12:37:58 Errors: 0
> 2015/02/19 12:37:58 Offered Hit ratio: 59.95%
> 2015/02/19 12:37:58 Measured Hit ratio: 37.20%
> 2015/02/19 12:37:58 Hit bytes: 4649000609
> 2015/02/19 12:37:58 Hit success: 599476/599476 (100.00%), 469.840902ms/req
> 2015/02/19 12:37:58 Miss success: 400524/400524 (100.00%), 336.301464ms/req
> {noformat}
> So similar results, 37.20% on average. Then I thought that could be a problem 
> of how I'm testing stuff, and tried with nginx cache. It achieves 60% hit 
> ratio, but request rate is very slow compared to ATS for obvious reasons.
> Then I wanted to check if with 200 connections but with longer test time hit 
> ratio also dropped, but no, it's fine:
> !http://i.imgur.com/oMHscuf.png!
> So not a problem of my tests I guess.
> Then I realized by debugging the test server that the same url was asked 
> twice.
> Out of 1000000 requests, 78600 urls were asked at least twice. An url was 
> even requested 9 times. These same url are not requested close to each other: 
> even more than 30sec can pass from one request to the other for the same url.
> I also tweaked the following parameters:
> {noformat}
> CONFIG proxy.config.http.cache.fuzz.time INT 0
> CONFIG proxy.config.http.cache.fuzz.min_time INT 0
> CONFIG proxy.config.http.cache.fuzz.probability FLOAT 0.000000
> CONFIG proxy.config.http.cache.max_open_read_retries INT 4
> CONFIG proxy.config.http.cache.open_read_retry_time INT 500
> {noformat}
> And this is the result with polygraph, similar results:
> !http://i.imgur.com/YgOndhY.png!
> Tweaked the read-while-writer option, and yet having similar results.
> Then I've enabled 1GB of ram, it is slightly better at the beginning, but 
> then it drops:
> !http://i.imgur.com/dFTJI16.png!
> traffic_top says 25% ram hit, 37% fresh, 63% cold.
> So given that it doesn't seem to be a concurrency problem when requesting the 
> url to the origin server, could it be a problem of concurrent write access to 
> the cache? So that some pages are not cached at all? The traffoc_top fresh 
> percentage also makes me think it can be a problem in writing the cache.
> Not sure if I explained the problem correctly, ask me further information in 
> case. But in summary: hit ratio drops with a high number of connections, and 
> the problem seems related to pages that are not written to the cache.
> This is some related issue: 
> http://mail-archives.apache.org/mod_mbox/trafficserver-users/201301.mbox/%3ccd28cb1f.1f44a%25peter.wa...@email.disney.com%3E
> Also this: 
> http://apache-traffic-server.24303.n7.nabble.com/why-my-proxy-node-cache-hit-ratio-drops-td928.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to