[ 
https://issues.apache.org/jira/browse/CASSANDRA-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13985821#comment-13985821
 ] 

Joshua McKenzie edited comment on CASSANDRA-6890 at 4/30/14 5:49 PM:
---------------------------------------------------------------------

Running cql3 native, snappy compression, and mixed load w/1-to-1 ratio looks to 
have normalized a lot of the performance differential I was seeing.  Using 
linux mmap as a baseline:

Raw op/s:
| |windows buffered|windows mmap|linux buffered|linux mmap|
|  4 threadCount|2236|2171|1953|2111|
|  8 threadCount|4716|4673|3955|4300|
| 16 threadCount|7605|7529|6795|7465|
| 24 threadCount|8662|9231|8341|8819|
| 36 threadCount|13907|13147|13237|14451|
| 54 threadCount|24039|24817|24177|26073|
| 81 threadCount|39016|43673|34154|40929|
|121 threadCount|40494|49513|42658|48313|
|181 threadCount|53189|53039|49691|52885|
|271 threadCount|53447|55354|54842|58779|
|406 threadCount|54853|54295|60108|64675|
|609 threadCount|60067|56145|61823|70885|
|913 threadCount|57333|58483|60763|70398|

% Comparison:
| | windows buffered|windows mmap|linux buffered|linux mmap|
|  4 threadCount|105.92%|102.84%|92.52%|100.00%|
|  8 threadCount|109.67%|108.67%|91.98%|100.00%|
| 16 threadCount|101.88%|100.86%|91.02%|100.00%|
| 24 threadCount|98.22%|104.67%|94.58%|100.00%|
| 36 threadCount|96.24%|90.98%|91.60%|100.00%|
| 54 threadCount|92.20%|95.18%|92.73%|100.00%|
| 81 threadCount|95.33%|106.70%|83.45%|100.00%|
|121 threadCount|83.82%|102.48%|88.30%|100.00%|
|181 threadCount|100.57%|100.29%|93.96%|100.00%|
|271 threadCount|90.93%|94.17%|93.30%|100.00%|
|406 threadCount|84.81%|83.95%|92.94%|100.00%|
|609 threadCount|84.74%|79.21%|87.22%|100.00%|
|913 threadCount|81.44%|83.07%|86.31%|100.00%|

As Benedict indicated, an in-process page cache should make the debate between 
these two paths moot.  The results above are quite close to the 10% threshold 
you've indicated Jonathan; I'd be comfortable normalizing the system on 
buffered I/O leading up to 3.0 to give us a single read path to migrate to an 
in-process page cache.  I certainly don't see a need for us to keep the mmap'ed 
path on Windows as there doesn't appear to be a performance differential when 
using a more representative work-load on cql3.

As an aside, do we have a documented set of suggestions on how people should 
approach stress-testing Cassandra, or perhaps a set of performance regression 
testing we do against releases?  Nothing beats specialized expertise in tuning 
the stress workload to your expected usage patterns but it might help to give 
people a baseline and a starting point for their own testing.

Pavel: I did record perf runs from both buffered and memory-mapped performance 
on linux, but given how close the results above are I don't know how much value 
we'll be able to pull from them.  I can attach them to the ticket if you're 
still interested.


was (Author: joshuamckenzie):
Running cql3 native, snappy compression, and mixed load w/1-to-1 ratio looks to 
have normalized a lot of the performance differential I was seeing.  Using 
linux mmap as a baseline:

Raw op/s:
| |windows buffered|windows mmap|linux buffered|linux mmap|
|  4 threadCount|2236|2171|1953|2111|
|  8 threadCount|4716|4673|3955|4300|
| 16 threadCount|7605|7529|6795|7465|
| 24 threadCount|8662|9231|8341|8819|
| 36 threadCount|13907|13147|13237|14451|
| 54 threadCount|24039|24817|24177|26073|
| 81 threadCount|39016|43673|34154|40929|
|121 threadCount|40494|49513|42658|48313|
|181 threadCount|53189|53039|49691|52885|
|271 threadCount|53447|55354|54842|58779|
|406 threadCount|54853|54295|60108|64675|
|609 threadCount|60067|56145|61823|70885|
|913 threadCount|57333|58483|60763|70398|

% Comparison:
| | windows buffered|windows mmap|linux buffered|linux mmap|
|  4 threadCount|105.92%|102.84%|92.52%|100.00%|
|  8 threadCount|109.67%|108.67%|91.98%|100.00%|
| 16 threadCount|101.88%|100.86%|91.02%|100.00%|
| 24 threadCount|98.22%|104.67%|94.58%|100.00%|
| 36 threadCount|96.24%|90.98%|91.60%|100.00%|
| 54 threadCount|92.20%|95.18%|92.73%|100.00%|
| 81 threadCount|95.33%|106.70%|83.45%|100.00%|
|121 threadCount|83.82%|102.48%|88.30%|100.00%|
|181 threadCount|100.57%|100.29%|93.96%|100.00%|
|271 threadCount|90.93%|94.17%|93.30%|100.00%|
|406 threadCount|84.81%|83.95%|92.94%|100.00%|
|609 threadCount|84.74%|79.21%|87.22%|100.00%|
|913 threadCount|81.44%|83.07%|86.31%|100.00%|

As Benedict indicated, and in-process page cache should make the debate between 
these two paths moot.  The results above are quite close to the 10% threshold 
you've indicated Jonathan; I'd be comfortable normalizing the system on 
buffered I/O leading up to 3.0 to give us a single read path to migrate to an 
in-process page cache.  I certainly don't see a need for us to keep the mmap'ed 
path on Windows as there doesn't appear to be a performance differential when 
using a more representative work-load on cql3.

As an aside, do we have a documented set of suggestions on how people should 
approach stress-testing Cassandra, or perhaps a set of performance regression 
testing we do against releases?  Nothing beats specialized expertise in tuning 
the stress workload to your expected usage patterns but it might help to give 
people a baseline and a starting point for their own testing.

Pavel: I did record perf runs from both buffered and memory-mapped performance 
on linux, but given how close the results above are I don't know how much value 
we'll be able to pull from them.  I can attach them to the ticket if you're 
still interested.

> Standardize on a single read path
> ---------------------------------
>
>                 Key: CASSANDRA-6890
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6890
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Joshua McKenzie
>            Assignee: Joshua McKenzie
>              Labels: performance
>             Fix For: 3.0
>
>         Attachments: mmap_gc.jpg, mmap_jstat.txt, mmap_perf.txt, 
> nommap_gc.jpg, nommap_jstat.txt
>
>
> Since we actively unmap unreferenced SSTR's and also copy data out of those 
> readers on the read path, the current memory mapped i/o is a lot of 
> complexity for very little payoff.  Clean out the mmapp'ed i/o on the read 
> path.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to