Hi Kevin,
On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf <[email protected]> wrote:
> Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben:
>> On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote:
>> > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf <[email protected]> wrote:
>> > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben:
>> > >> Hi,
>> > >>
>> > >> These patches bring up below 4 changes:
>> > >> - introduce object allocation pool and apply it to
>> > >> virtio-blk dataplane for improving its performance
>> > >>
>> > >> - introduce selective coroutine bypass mechanism
>> > >> for improving performance of virtio-blk dataplane with
>> > >> raw format image
>> > >
>> > > Before applying any bypassing patches, I think we should understand in
>> > > detail where we are losing performance with coroutines enabled.
>> >
>> > From the below profiling data, CPU becomes slow to run instructions
>> > with coroutine, and CPU dcache miss is increased so it is very
>> > likely caused by switching stack frequently.
>> >
>> > http://marc.info/?l=qemu-devel&m=140679721126306&w=2
>> >
>> > http://pastebin.com/ae0vnQ6V
>>
>> I have been wondering how to prove that the root cause is the ucontext
>> coroutine mechanism (stack switching). Here is an idea:
>>
>> Hack your "bypass" code path to run the request inside a coroutine.
>> That way you can compare "bypass without coroutine" against "bypass with
>> coroutine".
>>
>> Right now I think there are doubts because the bypass code path is
>> indeed a different (and not 100% correct) code path. So this approach
>> might prove that the coroutines are adding the overhead and not
>> something that you bypassed.
>
> My doubts aren't only that the overhead might not come from the
> coroutines, but also whether any coroutine-related overhead is really
> unavoidable. If we can optimise coroutines, I'd strongly prefer to do
> just that instead of introducing additional code paths.
OK, thank you for taking look at the problem, and hope we can
figure out the root cause, :-)
>
> Another thought I had was this: If the performance difference is indeed
> only coroutines, then that is completely inside the block layer and we
> don't actually need a VM to test it. We could instead have something
> like a simple qemu-img based benchmark and should be observing the same.
Even it is simpler to run a coroutine-only benchmark, and I just
wrote a raw one, and looks coroutine does decrease performance
a lot, please see the attachment patch, and thanks for your template
to help me add the 'co_bench' command in qemu-img.
>From the profiling data in below link:
http://pastebin.com/YwH2uwbq
With coroutine, the running time for same loading is increased
~50%(1.325s vs. 0.903s), and dcache load events is increased
~35%(693M vs. 512M), insns per cycle is decreased by ~50%(
1.35 vs. 1.63), compared with bypassing coroutine(-b parameter).
The bypass code in the benchmark is very similar with the approach
used in the bypass patch, since linux-aio with O_DIRECT seldom
blocks in the the kernel I/O path.
Maybe the benchmark is a bit extremely, but given modern storage
device may reach millions of IOPS, and it is very easy to slow down
the I/O by coroutine.
> I played a bit with the following, I hope it's not too naive. I couldn't
> see a difference with your patches, but at least one reason for this is
> probably that my laptop SSD isn't fast enough to make the CPU the
> bottleneck. Haven't tried ramdisk yet, that would probably be the next
> thing. (I actually wrote the patch up just for some profiling on my own,
> not for comparing throughput, but it should be usable for that as well.)
This might not be good for the test since it is basically a sequential
read test, which can be optimized a lot by kernel. And I always use
randread benchmark.
Thanks,
diff --git a/Makefile b/Makefile
index d6b9dc1..a59523c 100644
--- a/Makefile
+++ b/Makefile
@@ -211,7 +211,7 @@ util/module.o-cflags = -D'CONFIG_BLOCK_MODULES=$(block-modules)'
qemu-img.o: qemu-img-cmds.h
-qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a
+qemu-img$(EXESUF): qemu-img.o $(block-obj-y) libqemuutil.a libqemustub.a -lcrypt
qemu-nbd$(EXESUF): qemu-nbd.o $(block-obj-y) libqemuutil.a libqemustub.a
qemu-io$(EXESUF): qemu-io.o $(block-obj-y) libqemuutil.a libqemustub.a
diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx
index ae64b3d..7601b9a 100644
--- a/qemu-img-cmds.hx
+++ b/qemu-img-cmds.hx
@@ -15,6 +15,12 @@ STEXI
@item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename
ETEXI
+DEF("co_bench", co_bench,
+ "co_bench -c count -q -b")
+STEXI
+@item co_bench [-c] @var{count} [-b] [-q]
+ETEXI
+
DEF("check", img_check,
"check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] filename")
STEXI
diff --git a/qemu-img.c b/qemu-img.c
index 92e9529..d73b171 100644
--- a/qemu-img.c
+++ b/qemu-img.c
@@ -366,6 +366,106 @@ static int add_old_style_options(const char *fmt, QemuOpts *opts,
return 0;
}
+struct crypt_data {
+ unsigned long sum;
+ bool bypass;
+};
+
+static unsigned long crypt_and_sum(const char *key, const char *salt)
+{
+ char *enc = crypt(key, salt);
+ int len = strlen(enc);
+ int i;
+ unsigned long sum = 0;
+
+ for (i = 0; i < len; i++)
+ sum += enc[i];
+
+ return sum;
+}
+
+static void gen_key(char *key, int len)
+{
+ char set[] = {
+ '0', '1', '2', '3', '4', '5', '6', '7', '8',
+ 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i',
+ 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r',
+ 's', 't', 'w', 'x', 'y', 'z', '_', '-', ' ',
+ '*', '$', '#', '%',
+ };
+ int cnt = sizeof(set) / sizeof(set[0]);
+ int i;
+
+ for (i = 0; i < len; i++) {
+ key[i] = set[rand() % cnt];
+ }
+}
+
+static void crypt_bench(void *opaque)
+{
+ struct crypt_data *data = opaque;
+ const int len = 8;
+ char key1[8];
+ char key2[8];
+ char salt[3];
+
+ gen_key(key1, len);
+ gen_key(key2, len);
+ salt[0] = key1[0];
+ salt[1] = key2[7];
+ salt[2] = '0';
+
+ data->sum += crypt_and_sum(key1, salt);
+ data->sum += crypt_and_sum(key2, salt);
+
+ if (!data->bypass) {
+ qemu_coroutine_yield();
+ }
+}
+
+static int co_bench(int argc, char **argv)
+{
+ int c;
+ bool bypass = false;
+ unsigned long cnt = 1;
+ int num = 1;
+ unsigned long i;
+ struct crypt_data data = {
+ .sum = 0,
+ .bypass = bypass,
+ };
+
+ for(;;) {
+ c = getopt(argc, argv, "bc:q");
+ if (c == -1) {
+ break;
+ }
+ switch(c) {
+ case 'b':
+ bypass = true;
+ break;
+ case 'c':
+ num = atoi(optarg);
+ break;
+ }
+ }
+
+ data.bypass = bypass;
+
+ srand((unsigned int)&i);
+ srand(rand());
+ for (i = 0; i < num * cnt; i++) {
+ Coroutine *co;
+ if (!data.bypass) {
+ co = qemu_coroutine_create(crypt_bench);
+ qemu_coroutine_enter(co, &data);
+ } else {
+ crypt_bench(&data);
+ }
+ }
+ return (int)data.sum;
+}
+
static int img_create(int argc, char **argv)
{
int c;