Overall I must say I am not a performance tuning/measuring expert and 
clearly have lots of things to learn ;-) BTW can you point me to any 
performance setup/procedures/docs that you guys used with OSv?
I also feel I have tried to kill too many birds with one stone. Ideally I 
should have divided whole thing into 3 categories:
- OSv on firecracker vs QEMU
- OSv vs Docker
- OSv vs Linux guest

On Tuesday, March 26, 2019 at 8:32:00 PM UTC-4, דור לאור wrote:
>
> While the performance numbers indicate something, a mac book is a horrible 
> environment for performance
> testing. There are effects of other desktop apps, hyperthreading, etc.
>
Well that is what I have available in my home lab :-) I understand you are 
suggesting that apps running on the MacBook might affect and skew the 
results. I made sure the only apps open was one or two terminal windows. I 
had also mpstat open and most of the time CPUs were idle when tests were 
not running. But I get your point that ideally I should use proper headless 
server machine. I also get the effect of hyper threading - is there a way 
to switch it off in Linux by some kind of boot parameter? 

Also 1gbps network can be a bottle neck. 
>
Very likely, I have been suspecting same thing. 

Every benchmark case should have a matching performance
> analysis and point to the bottleneck reason - cpu/networking/contect 
> switching/locking/filesystem/..
>
To figure this out I guess I would need to use OSv tracing capability 
- https://github.com/cloudius-systems/osv/wiki/Trace-analysis-using-trace.py 

Just hyperthread vs a different thread in another core is very significant 
> change.
> Need to pin the qemu threads in the host to the right physical threads.
>
I was not even aware that one can pin to specific CPUs. What parameters 
pass to qemu?

>
> Better to run on a good physical server (like i3.metal on AWS or similar, 
> could be smaller but not 2 cores) and
> track all the metrics appropriately. Best is to isolate workloads (and 
> make sure they scale linearly too)  in terms of cpu/mem/net/disk and only 
> then
> show how a more complex workload performs. 
>
Cannot afford 5$ per hour ;-) Unless I have fully automated test suite. 

My dream would be to have an automated process I could trigger with a 
single click of a button that would:
1) Use cloud formation template to create a VPC with all components of the 
test environment.
2) Automatically start each instance under test and corresponding test 
client
3) Automatically collect all test results (both wrk and possibly tracing 
data) and put them somewhere in S3. 

Finally If I had a suite of visualization tools that would generate 
whatever graphs I need to analyze. It would save soooooo much time. 
Possibly under hour => then I could pay 5 bucks for it ;-)

But it takes time to build one ;-)

>
> On Tue, Mar 26, 2019 at 3:29 PM Waldek Kozaczuk <[email protected] 
> <javascript:>> wrote:
>
>> Last week I spent some time investigating OSv performance and comparing 
>> it to Docker and Linux guests. To that end I adopted 
>> "unikernels-v-containers"' repo by Tom Goethals and extended it with 2 new 
>> apps (Rust and Node.js) and new scripts to build and deploy OSv apps on 
>> QEMU/KVM - https://github.com/wkozaczuk/unikernels-v-containers. So as 
>> you can see my focus was on OSv on QEMU/KVM and firecracker vs Linux on 
>> firecracker vs Docker whereas Tom's paper was comparing OSv on Xen vs 
>> Docker (details of discussion around it and the link to the paper you can 
>> find here - https://groups.google.com/forum/#!topic/osv-dev/lhkqFfzbHwk).
>>
>> Specifically I wanted to compare networking performance in terms of 
>> number of REST API requests per second processed by a typical microservice 
>> app implemented in Rust (built using hyper), Golang and Java (built using 
>> vertx.io) and running on following:
>>
>>    - OSv on QEMU/KVM
>>    - OSv on firecracker
>>    - Docker container
>>    - Linux on firecracker
>>
>> Each app in essence implements simple todo REST api returning a json 
>> payload 100-200 characters long (for example see here Java one - 
>> https://github.com/wkozaczuk/unikernels-v-containers/blob/master/restapi/java-osv/src/main/java/rest/SimpleREST.java).
>>  
>> The source code of all apps is under this subtree - 
>> https://github.com/wkozaczuk/unikernels-v-containers/blob/master/restapi. 
>> One thing to not was that each request would return always the same payload 
>> (I wonder if that may cause the response gets cached and affects results).
>>
>> The test setup looked like this:
>>
>> *Host:*
>>
>>    - MacBook Pro with Intel i7 4 cores CPU with hyperthreading (8 cpus 
>>    reported by lscpu) with 16GB of RAM with Ubuntu 18.10 on it 
>>    - firecracker 0.15.0
>>    - QEMU 2.12.0
>>
>>
>> *Client machine:*
>>
>>    - similar to the one above with wrk as a test client firing requests 
>>    using 10 threads and 100 open connections for 30 seconds in 3 series one 
>> by 
>>    one (please see this test script - 
>>    
>> https://github.com/wkozaczuk/unikernels-v-containers/blob/master/test-restapi-with-wrk.sh
>>    ).
>>    - wrk by default uses Keep-Alive for http connections so TCP 
>>    handshake is minimal 
>>
>> The host and client machine were connected directly to 1 GBit ethernet 
>> switch and host exposed guest IP using a bridged TAP nic (please see the 
>> script used - 
>> https://raw.githubusercontent.com/cloudius-systems/osv/master/scripts/setup-external-bridge.sh
>> ).
>>
>> You can find scripts to start applications on OSv and docker here - 
>> https://github.com/wkozaczuk/unikernels-v-containers (run* scripts). 
>> Please note --cpu-set parameter used in docker script to limit number of 
>> CPUs.
>>
>> You can find detailed results under 
>> https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote
>> .
>>
>> Here are just requests per seconds numbers (full example -  
>> https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/test_results/remote/docker/rust_docker_4_cpu.wrk
>> )
>>
>> OSv on QEMU
>> *Golang*
>> *1 CPU*
>> Requests/sec:  24313.06
>> Requests/sec:  23874.74
>> Requests/sec:  23300.26
>>
>> *2 CPUs*
>> Requests/sec:  37089.26
>> Requests/sec:  35475.22
>> Requests/sec:  33581.87
>>
>> *4 CPUs*
>> Requests/sec:  42747.11
>> Requests/sec:  43057.99
>> Requests/sec:  42346.27
>>
>> *Java*
>> *1 CPU*
>> Requests/sec:  41049.41
>> Requests/sec:  43622.81
>> Requests/sec:  44777.60
>> *2 CPUs*
>> Requests/sec:  46245.95
>> Requests/sec:  45746.48
>> Requests/sec:  46224.42
>> *4 CPUs*
>> Requests/sec:  48128.33
>> Requests/sec:  45467.53
>> Requests/sec:  45776.45
>>
>> *Rust*
>>
>> *1 CPU*
>> Requests/sec:  43455.34
>> Requests/sec:  43927.73
>> Requests/sec:  41100.07
>>
>> *2 CPUs*
>> Requests/sec:  49120.31
>> Requests/sec:  49298.28
>> Requests/sec:  48076.98
>> *4 CPUs*
>> Requests/sec:  51477.57
>> Requests/sec:  51587.92
>> Requests/sec:  49118.68
>>
>> OSv on firecracker
>> *Golang*
>>
>> *1 cpu*
>> Requests/sec:  16721.56
>> Requests/sec:  16422.33
>> Requests/sec:  16540.24
>>
>> *2 cpus*
>> Requests/sec:  28538.35
>> Requests/sec:  26676.68
>> Requests/sec:  28100.00
>>
>> *4 cpus*
>> Requests/sec:  36448.57
>> Requests/sec:  33808.45
>> Requests/sec:  34383.20
>>
>>
>> *Java*
>> *1 cpu*
>> Requests/sec:  20191.95
>> Requests/sec:  21384.60
>> Requests/sec:  21705.82
>>
>> *2 cpus*
>> Requests/sec:  40876.17
>> Requests/sec:  40625.69
>> Requests/sec:  43766.45
>> 4 cpus
>> Requests/sec:  46336.07
>> Requests/sec:  45933.35
>> Requests/sec:  45467.22
>>
>>
>> *Rust*
>> *1 cpu*
>> Requests/sec:  23604.27
>> Requests/sec:  23379.86
>> Requests/sec:  23477.19
>>
>> *2 cpus*
>> Requests/sec:  46973.84
>> Requests/sec:  46590.41
>> Requests/sec:  46128.15
>>
>> *4 cpus*
>> Requests/sec:  49491.98
>> Requests/sec:  50255.20
>> Requests/sec:  50183.11
>>
>> Linux on firecracker
>> *Golang*
>>
>> *1 CPU*
>> Requests/sec:  14498.02
>> Requests/sec:  14373.21
>> Requests/sec:  14213.61
>>
>> *2 CPU*
>> Requests/sec:  28201.27
>> Requests/sec:  28600.92
>> Requests/sec:  28558.33
>>
>> *4 CPU*
>> Requests/sec:  48983.83
>> Requests/sec:  47590.97
>> Requests/sec:  45758.82
>>
>> *Java*
>>
>> *1 CPU*
>> Requests/sec:  18217.58
>> Requests/sec:  17709.30
>> Requests/sec:  19829.01
>>
>> *2 CPU*
>> Requests/sec:  33188.75
>> Requests/sec:  33233.55
>> Requests/sec:  36951.05
>>
>> *4 CPU*
>> Requests/sec:  47718.13
>> Requests/sec:  46456.51
>> Requests/sec:  48408.99
>>
>> *Rust*
>> Could not get same rust on Alpine linux that uses musl
>>
>> Docker
>> *Golang*
>>
>> *1 CPU*
>> Requests/sec:  24568.70
>> Requests/sec:  24621.82
>> Requests/sec:  24451.52
>>
>> *2 CPU*
>> Requests/sec:  49366.54
>> Requests/sec:  48510.87
>> Requests/sec:  43809.97
>>
>> *4 CPU*
>> Requests/sec:  53613.09
>> Requests/sec:  53033.38
>> Requests/sec:  51422.59
>>
>> *Java*
>>
>> *1 CPU*
>> Requests/sec:  40078.52
>> Requests/sec:  43850.54
>> Requests/sec:  44588.22
>>
>> *2 CPUs*
>> Requests/sec:  48792.39
>> Requests/sec:  51170.05
>> Requests/sec:  52033.04
>>
>> *4 CPUs*
>> Requests/sec:  51409.24
>> Requests/sec:  52756.73
>> Requests/sec:  47126.19
>>
>> *Rust*
>>
>> *1 CPU*Requests/sec:  40220.04
>> Requests/sec:  44601.38
>> Requests/sec:  44419.06
>>
>> *2 CPUs*
>> Requests/sec:  53420.56
>> Requests/sec:  53490.33
>> Requests/sec:  53320.99
>>
>> *4 CPUs*
>> Requests/sec:  53892.23
>> Requests/sec:  52814.93
>> Requests/sec:  54050.13
>>
>> Full example (Rust 4 CPUs - 
>> https://raw.githubusercontent.com/wkozaczuk/unikernels-v-containers/master/test_results/remote/docker/rust_docker_4_cpu.wrk
>> ):
>> [{"name":"Write 
>> presentation","completed":false,"due":"2019-03-23T15:30:40.579556117+00:00"},{"name":"Host
>>  
>> meetup","completed":false,"due":"2019-03-23T15:30:40.579599959+00:00"},{"name":"Run
>>  
>> tests","completed":false,"due":"2019-03-23T15:30:40.579600610+00:00"},{"name":"Stand
>>  
>> in 
>> traffic","completed":false,"due":"2019-03-23T15:30:40.579601081+00:00"},{"name":"Learn
>>  
>> Rust","completed":false,"due":"2019-03-23T15:30:40.579601548+00:00"}]-----------------------------------
>> Running 30s test @ http://192.168.1.73:8080/todos
>>   10 threads and 100 connections
>>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>>     Latency     1.86ms    1.20ms  30.81ms   62.92%
>>     Req/Sec     5.42k   175.14     5.67k    87.71%
>>   1622198 requests in 30.10s, 841.55MB read
>> Requests/sec:  53892.23
>> Transfer/sec:     27.96MB
>> -----------------------------------
>> Running 30s test @ http://192.168.1.73:8080/todos
>>   10 threads and 100 connections
>>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>>     Latency     1.90ms    1.19ms   8.98ms   58.18%
>>     Req/Sec     5.31k   324.18     5.66k    90.10%
>>   1589778 requests in 30.10s, 824.73MB read
>> Requests/sec:  52814.93
>> Transfer/sec:     27.40MB
>> -----------------------------------
>> Running 30s test @ http://192.168.1.73:8080/todos
>>   10 threads and 100 connections
>>   Thread Stats   Avg      Stdev     Max   +/- Stdev
>>     Latency     1.85ms    1.14ms   8.39ms   54.70%
>>     Req/Sec     5.44k   204.22     7.38k    92.12%
>>   1626902 requests in 30.10s, 843.99MB read
>> Requests/sec:  54050.13
>> Transfer/sec:     28.04MB
>>
>> I am also enclosing an example of iperf run between client and server 
>> machine to illustrate type of raw network bandwidth (BTW I test against 
>> iperf running on host natively and on OSv on qemu and firecracker I got 
>> pretty much identical results ~ 940 MBits/sec - see 
>> https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/remote
>> ). 
>>
>> Connecting to host 192.168.1.102, port 5201
>> [  5] local 192.168.1.98 port 65179 connected to 192.168.1.102 port 5201
>> [ ID] Interval           Transfer     Bitrate
>> [  5]   0.00-1.00   sec   111 MBytes   930 Mbits/sec
>> [  5]   1.00-2.00   sec   111 MBytes   932 Mbits/sec
>> [  5]   2.00-3.00   sec   112 MBytes   938 Mbits/sec
>> [  5]   3.00-4.00   sec   112 MBytes   939 Mbits/sec
>> [  5]   4.00-5.00   sec   112 MBytes   940 Mbits/sec
>> [  5]   5.00-6.00   sec   111 MBytes   933 Mbits/sec
>> [  5]   6.00-7.00   sec   112 MBytes   940 Mbits/sec
>> [  5]   7.00-8.00   sec   112 MBytes   940 Mbits/sec
>> [  5]   8.00-9.00   sec   112 MBytes   941 Mbits/sec
>> [  5]   9.00-10.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  10.00-11.00  sec   112 MBytes   939 Mbits/sec
>> [  5]  11.00-12.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  12.00-13.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  13.00-14.00  sec   112 MBytes   942 Mbits/sec
>> [  5]  14.00-15.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  15.00-16.00  sec   111 MBytes   927 Mbits/sec
>> [  5]  16.00-17.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  17.00-18.00  sec   112 MBytes   942 Mbits/sec
>> [  5]  18.00-19.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  19.00-20.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  20.00-21.00  sec   112 MBytes   936 Mbits/sec
>> [  5]  21.00-22.00  sec   112 MBytes   940 Mbits/sec
>> [  5]  22.00-23.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  23.00-24.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  24.00-25.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  25.00-26.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  26.00-27.00  sec   112 MBytes   940 Mbits/sec
>> [  5]  27.00-28.00  sec   112 MBytes   941 Mbits/sec
>> [  5]  28.00-29.00  sec   112 MBytes   940 Mbits/sec
>> [  5]  29.00-30.00  sec   112 MBytes   941 Mbits/sec
>> - - - - - - - - - - - - - - - - - - - - - - - - -
>> [ ID] Interval           Transfer     Bitrate
>> [  5]   0.00-30.00  sec  3.28 GBytes   939 Mbits/sec                  
>> sender
>> [  5]   0.00-30.00  sec  3.28 GBytes   939 Mbits/sec                  
>> receiver
>>
>> iperf Done.
>>
>>
>> Observations/Conclusions
>>
>>    - OSv fares a little better on QEMU/KVM than firecracker and that 
>>    varies from ~5% to ~20% (Golang). Also please note vast difference 
>> between 
>>    1 cpu test results on firecracker and QEMU (hyperthreading is handled 
>>    differently). On QEMU there is a small bump from 1 to 2 to 4 cpus except 
>>    for Golang, on firecracker there is almost ~90-100% bump from 1 to 2 
>> cpus. 
>>       - To that end I have opened firecracker issue - 
>>       https://github.com/firecracker-microvm/firecracker/issues/1034.
>>       - When you compare OSv on firecracker vs Linux on firecracker 
>>    (comparing OSv on QEMU would be I guess unfair) you can see that:
>>       - Golang app on OSv was ~ 15% better vs on Linux with 1 cpu, 
>>       almost identical with 2 cpus and app being faster on Linux ~30% with 4 
>> CPUs 
>>       (I did check that Golang runtime properly detects number of cpus)
>>       - Java app on OSv was ~ 5% faster with 1 CPU, ~ 20% faster with 2 
>>       CPUs and slightly slower with 4 CPUs
>>       - Could not run Rust app on Linux because it was alpine 
>>       distribution built with musl and I did not have time to get Rust build 
>>       properly for that scenario
>>    - When you compare OSv on QEMU/KVM vs Docker you can see that:
>>       - All apps running with single CPU fares almost the same with OSv 
>>       being sometimes a little faster
>>       - Java and Rust apps performed only a little better (2-10%) on 
>>       Docker vs OSv
>>       - Golang on OSv scaled well with number of CPUs but performed much 
>>       worse on OSv (20-30%) with 2 and 4 cpus
>>    - There seems to be a bottleneck around 40-50K requests per seconds 
>>    somewhere. Looking at one result, the raw network rate reported was 
>> around 
>>    26-28MB per second. GIven that HTTP requests require sending request and 
>>    response possibly that is what is the maximum the network - combination 
>> of 
>>    ethernet switch and server and client machines - can handle?
>>
>>
>> Questions
>>
>>    - Are there any flaws in this test setup?
>>    - Why does OSv not scale in some scenarios - especially when bumping 
>>    from 2 to 4 cpus?? Networking bottleneck? Scheduler? Locks?
>>    - Could we further optimize OSv running with single CPU (skip global 
>>    cross-CPU page allocator, etc)?
>>
>>
>> To get even more insight I also compared how OSv on QEMU would fare 
>> against same app running in Docker with wrk running on the host and firing 
>> requests locally. You can find the results under  
>> https://github.com/wkozaczuk/unikernels-v-containers/tree/master/test_results/host
>> .
>>
>> OSv on QEMU
>> *Golang*
>>
>> *1 CPU*
>> Requests/sec:  25188.60
>> Requests/sec:  24664.43
>> Requests/sec:  23935.77
>> *2 CPUs*
>> Requests/sec:  37118.95
>> Requests/sec:  37108.96
>> Requests/sec:  35997.58
>>
>> *4 CPUs*
>> Requests/sec:  49987.20
>> Requests/sec:  48710.74
>> Requests/sec:  44789.96
>>
>>
>> *Java*
>> *1 CPU*
>> Requests/sec:  43648.02
>> Requests/sec:  45457.98
>> Requests/sec:  41818.13
>>
>> *2 CPUs*
>> Requests/sec:  76224.39
>> Requests/sec:  75734.63
>> Requests/sec:  70597.35
>>
>> *4 CPUs*
>> Requests/sec:  80543.30
>> Requests/sec:  75187.46
>> Requests/sec:  72986.93
>>
>>
>> *Rust*
>> *1 CPU*
>> Requests/sec:  42392.75
>> Requests/sec:  39679.21
>> Requests/sec:  37871.49
>>
>> *2 CPUs*
>> Requests/sec:  82484.67
>> Requests/sec:  83272.65
>> Requests/sec:  71671.13
>>
>> *4 CPUs*
>> Requests/sec:  95910.23
>> Requests/sec:  86811.76
>> Requests/sec:  83213.93
>>
>>
>> Docker
>>
>> *Golang*
>> *1 CPU*
>> Requests/sec:  24191.63
>> Requests/sec:  23574.89
>> Requests/sec:  23716.33
>>
>> *2 CPUs*
>> Requests/sec:  34889.01
>> Requests/sec:  34487.01
>> Requests/sec:  34468.03
>>
>> *4 CPUs*
>> Requests/sec:  48850.24
>> Requests/sec:  48690.09
>> Requests/sec:  48356.66
>>
>>
>> *Java*
>> *1 CPU*
>> Requests/sec:  32267.09
>> Requests/sec:  34670.41
>> Requests/sec:  34828.68
>>
>> *2 CPUs*
>> Requests/sec:  47533.94
>> Requests/sec:  50734.05
>> Requests/sec:  50203.98
>>
>> *4 CPUs*
>> Requests/sec:  69644.61
>> Requests/sec:  72704.40
>> Requests/sec:  70805.84
>>
>>
>> *Rust*
>> *1 CPU*
>> Requests/sec:  37061.52
>> Requests/sec:  36637.62
>> Requests/sec:  33154.57
>>
>> *2 CPUs*
>> Requests/sec:  51743.94
>> Requests/sec:  51476.78
>> Requests/sec:  50934.27
>>
>> *4 CPUs*
>> Requests/sec:  75125.41
>> Requests/sec:  74051.27
>> Requests/sec:  74434.78
>>
>>    - Does this test even make sense?
>>    - As you can see OSv outperforms docker in this scenario to various 
>>    degree by 5-20%.  Can anybody explain why? Is it because in this case 
>> iboth 
>>    wrk and apps are on the same machine and number of context switches are 
>>    fewer between kernel and user mode in favor of OSv? Does it mean that we 
>>    could benefit from a setup with a load balancer (for example like haproxy 
>>    or squid) that would be running on the same host in user mode and 
>>    forwarding to single-CPU OSv instances vs single OSv with multiple CPUs? 
>>
>> Looking forward to hear what others think. 
>>
>> Waldek
>>
>>
>>
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "OSv Development" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups "OSv 
Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to