On Sun, Jun 20, 2021 at 7:46 PM Richard W.M. Jones <[email protected]> wrote: > > As Nir has often pointed out, our current default request buffer size > (32MB) is too large, resulting in nbdcopy being as much as 2½ times > slower than it could be. > > The optimum buffer size most likely depends on the hardware, and may > even vary over time as machines get generally larger caches. To > explore the problem I used this command: > > $ hyperfine -P rs 15 25 'nbdkit -U - sparse-random size=100G seed=1 --run > "nbdcopy --request-size=\$((2**{rs})) \$uri \$uri"'
This uses the same process for serving both reads and writes, which may be different from real world usage when one process is used for reading and one for writing. > On my 2019-era AMD server with 32GB of RAM and 64MB * 4 of L3 cache, > 2**18 (262144) was the optimum when I tested all sizes between > 2**15 (32K) and 2**25 (32M, the current default). > > Summary > 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy > --request-size=\$((2**18)) \$uri \$uri"' ran > 1.03 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.06 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' > 1.09 ± 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' The difference is very small up to this point > 1.23 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.26 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.39 ± 0.04 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.45 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' > 1.61 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.94 ± 0.05 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 2.47 ± 0.08 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' > > My 2018-era Intel laptop with a measly 8 MB of L3 cache the optimum > size is one power-of-2 smaller (but 2**18 is still an improvement): > > Summary > 'nbdkit -U - sparse-random size=100G seed=1 --run "nbdcopy > --request-size=\$((2**17)) \$uri \$uri"' ran This matches results I got when testing the libev example on Lenovo T480s (~2018) and Dell Optiplex 9080 (~2012). > 1.05 ± 0.19 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**15)) \$uri \$uri"' > 1.06 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.10 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' > 1.22 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.29 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' > 1.33 ± 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.35 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.38 ± 0.01 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.45 ± 0.02 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 1.63 ± 0.03 times faster than 'nbdkit -U - sparse-random size=100G seed=1 > --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' > > To get an idea of the best request size on something rather different, > this is a Raspberry Pi 4B. I had to reduce the copy size down by a > factor of 10 (to 10G) to make it run in a reasonable time. 2**18 is > about 8% slower than the optimum choice (2**15). It's still > significantly better than our current default. > > Summary > 'nbdkit -U - sparse-random size=10G seed=1 --run "nbdcopy > --request-size=\$((2**15)) \$uri \$uri"' ran > 1.00 ± 0.04 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**21)) \$uri \$uri"' > 1.03 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**20)) \$uri \$uri"' > 1.04 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**22)) \$uri \$uri"' > 1.05 ± 0.08 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**16)) \$uri \$uri"' > 1.05 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**19)) \$uri \$uri"' > 1.07 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**17)) \$uri \$uri"' > 1.08 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**18)) \$uri \$uri"' > 1.15 ± 0.05 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**23)) \$uri \$uri"' > 1.28 ± 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**24)) \$uri \$uri"' > 1.35 ± 0.06 times faster than 'nbdkit -U - sparse-random size=10G seed=1 > --run "nbdcopy --request-size=\$((2**25)) \$uri \$uri"' But all these results do not test real work copy. They test copying from memory to memory with zero (practical) latency. When I tested using real storage on a real server, I got best results using 16 requests and one connection and a request size of 1m. 4 connections with 4 requests per connection with the same request size seem to be ~10% faster in these conditions. I posted more info on these tests here: https://listman.redhat.com/archives/libguestfs/2021-May/msg00124.html Of course testing with other servers or storage can show different results, and it is impossible to find a value that will work best in all cases. I think we need to test both the number of requests and connections to improve the defaults. > --- > copy/main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/copy/main.c b/copy/main.c > index 0fddfc3..70534b5 100644 > --- a/copy/main.c > +++ b/copy/main.c > @@ -50,7 +50,7 @@ bool flush; /* --flush flag */ > unsigned max_requests = 64; /* --requests */ > bool progress; /* -p flag */ > int progress_fd = -1; /* --progress=FD */ > -unsigned request_size = MAX_REQUEST_SIZE; /* --request-size */ > +unsigned request_size = 1<<18; /* --request-size */ But this is clearly a better default. > unsigned sparse_size = 4096; /* --sparse */ > bool synchronous; /* --synchronous flag */ > unsigned threads; /* --threads */ > -- > 2.32.0 > Nir _______________________________________________ Libguestfs mailing list [email protected] https://listman.redhat.com/mailman/listinfo/libguestfs
