On 02/23/2018 08:29 AM, Peter Maydell wrote: > On 17 February 2018 at 18:22, Richard Henderson > <richard.hender...@linaro.org> wrote: >> Signed-off-by: Richard Henderson <richard.hender...@linaro.org> >> --- > >> diff --git a/target/arm/sve_helper.c b/target/arm/sve_helper.c >> index 86cd792cdf..ae433861f8 100644 >> --- a/target/arm/sve_helper.c >> +++ b/target/arm/sve_helper.c >> @@ -46,14 +46,14 @@ >> * >> * The return value has bit 31 set if N is set, bit 1 set if Z is clear, >> * and bit 0 set if C is set. >> - * >> - * This is an iterative function, called for each Pd and Pg word >> - * moving forward. >> */ >> >> /* For no G bits set, NZCV = C. */ >> #define PREDTEST_INIT 1 >> >> +/* This is an iterative function, called for each Pd and Pg word >> + * moving forward. >> + */ > > Why move this comment?
Meant to fold this to the first. But moving so that I can separately document... >> +/* This is an iterative function, called for each Pd and Pg word >> + * moving backward. >> + */ >> +static uint32_t iter_predtest_bwd(uint64_t d, uint64_t g, uint32_t flags) ... this. >> + do { >> \ >> + uint64_t out = 0, pg; >> \ >> + do { >> \ >> + i -= sizeof(TYPE), out <<= sizeof(TYPE); >> \ >> + TYPE nn = *(TYPE *)(vn + H(i)); >> \ >> + TYPE mm = *(TYPE *)(vm + H(i)); >> \ >> + out |= nn OP mm; >> \ >> + } while (i & 63); >> \ >> + pg = *(uint64_t *)(vg + (i >> 3)) & MASK; >> \ >> + out &= pg; >> \ >> + *(uint64_t *)(vd + (i >> 3)) = out; >> \ >> + flags = iter_predtest_bwd(out, pg, flags); >> \ >> + } while (i > 0); >> \ >> + return flags; >> \ >> +} > > Why do we iterate backwards through the vector? As far as I can > see the pseudocode iterates forwards, and I don't think it > makes a difference to the result which way we go. You're right, it does not make a difference to the result which way we iterate. Of the several different ways I've written loops over predicates, this is my favorite. It has several points in its favor: 1) Operate on full uint64_t predicate units instead of uint8_t or uint16_t sub-units. This means 1a) No big-endian adjustment required, 1b) Fewer memory loads. 2) No separate loop tail; it is shared with the main loop body. 3) A sub-point specific to predicate output, but the main loop gets to run un-predicated. Here the governing predicate is applied at the end: out &= pg. r~