On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson <r...@twiddle.net> wrote: > On 06/30/2016 06:45 AM, Peter Maydell wrote: >> >> On 29 June 2016 at 09:47, <vija...@cavium.com> wrote: >>> >>> From: Vijay <vija...@cavium.com> >>> >>> Use Neon instructions to perform zero checking of >>> buffer. This is helps in reducing total migration time. >> >> >>> diff --git a/util/cutils.c b/util/cutils.c >>> index 5830a68..4779403 100644 >>> --- a/util/cutils.c >>> +++ b/util/cutils.c >>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0xFFFF) >>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >>> +#elif __aarch64__ >>> +#include "arm_neon.h" >>> +#define VECTYPE uint64x2_t >>> +#define ALL_EQ(v1, v2) \ >>> + ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >>> +#define VEC_OR(v1, v2) ((v1) | (v2)) >> >> >> Should be '#elif defined(__aarch64__)'. I have made this >> tweak and put this patch in target-arm.next. > > > Consider > > #define VECTYPE uint32x4_t > #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) > > > which compiles down to > > 1c: 6e211c00 eor v0.16b, v0.16b, v1.16b > 20: 6eb0a800 umaxv s0, v0.4s > 24: 1e260000 fmov w0, s0 > 28: 6b1f001f cmp w0, wzr > 2c: 1a9f17e0 cset w0, eq > 30: d65f03c0 ret
For me this code compiles as below and migration time is ~100ms more. See below 3 trails of migration time 7039cc: 6eb0a800 umaxv s0, v0.4s 7039d0: 0e043c02 mov w2, v0.s[0] 7039d4: 350000c2 cbnz w2, 7039ec <f0+0xf4> 7039d8: 91002084 add x4, x4, #0x8 7039dc: 91020063 add x3, x3, #0x80 7039e0: eb01009f cmp x4, x1 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3070 milliseconds downtime: 55 milliseconds setup: 4 milliseconds transferred ram: 300637 kbytes throughput: 802.49 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062834 pages skipped: 0 pages normal: 70489 pages normal bytes: 281956 kbytes dirty sync count: 3 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 47 milliseconds setup: 5 milliseconds transferred ram: 290277 kbytes throughput: 775.61 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2064185 pages skipped: 0 pages normal: 67901 pages normal bytes: 271604 kbytes dirty sync count: 3 (qemu) (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 34 milliseconds setup: 5 milliseconds transferred ram: 294614 kbytes throughput: 787.19 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2063365 pages skipped: 0 pages normal: 68985 pages normal bytes: 275940 kbytes dirty sync count: 3 > > vs > > 34: 4e083c20 mov x0, v1.d[0] > 38: 4e083c01 mov x1, v0.d[0] > 3c: eb00003f cmp x1, x0 > 40: 52800000 mov w0, #0 > 44: 54000040 b.eq 4c <f0+0x18> > 48: d65f03c0 ret > 4c: 4e183c20 mov x0, v1.d[1] > 50: 4e183c01 mov x1, v0.d[1] > 54: eb00003f cmp x1, x0 > 58: 1a9f17e0 cset w0, eq > 5c: d65f03c0 ret > My patch compiles to below code and takes ~100ms less time #define VECTYPE uint64x2_t #define ALL_EQ(v1, v2) \ ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) 7039d0: 4e083c02 mov x2, v0.d[0] 7039d4: b5000102 cbnz x2, 7039f4 <f0+0xfc> 7039d8: 4e183c02 mov x2, v0.d[1] 7039dc: b50000c2 cbnz x2, 7039f4 <f0+0xfc> 7039e0: 91002084 add x4, x4, #0x8 7039e4: 91020063 add x3, x3, #0x80 7039e8: eb04003f cmp x1, x4 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2973 milliseconds downtime: 67 milliseconds setup: 5 milliseconds transferred ram: 293659 kbytes throughput: 809.45 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062791 pages skipped: 0 pages normal: 68748 pages normal bytes: 274992 kbytes dirty sync count: 3 (qemu) capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2972 milliseconds downtime: 47 milliseconds setup: 5 milliseconds transferred ram: 295972 kbytes throughput: 816.10 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062861 pages skipped: 0 pages normal: 69325 pages normal bytes: 277300 kbytes dirty sync count: 3 (qemu) capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2982 milliseconds downtime: 40 milliseconds setup: 5 milliseconds transferred ram: 293386 kbytes throughput: 806.26 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2063199 pages skipped: 0 pages normal: 68679 pages normal bytes: 274716 kbytes dirty sync count: 4 (qemu) Regards Vijay