ni...@lysator.liu.se (Niels Möller) writes: > Ideally, I'd like to have something like the below: > > uint64_t l0, l1, l2, h0, h1, h1; > > l0 = p0 & 0xfffffffffff; h0 = p0 >> 44; > l1 = p1 & 0xfffffffffff; h1 = p1 >> 44; > l2 = p2 & 0x3ffffffffff; h2 = p2 >> 42; > > ctx->hh0 = l0 + 5*h2; > ctx->hh1 = l1 + h0; > ctx->hh2 = l2 + h1; > > But likely won't work, since bound on p0, p1, p2 is 3 * 2^88 plus any > excess bits we had on the input state, and therefore number of excess > bits will grow with each iteration (but we could possibly do this for a few > iterations before reducing further).
Let me try to quantify this. Say the input state are numbers < a 2^44. Then we get p0, p1, p2 < 3a 2^88 Next, h0, h1 < 3a 2^44 h2 < 3a 2^46 The largest output is then the new hh0, hh0 < 2^44 + 5 (3a 2^46) = (60 a + 1) 2^44 So the state bound grows by almost 6 bits (could probably be sligthly less with a more careful analysis). So say we start with a = 2 (state numbers < 2^45), then after three iterations they have grown to 63 bits, and we'd have to do a round of additional folding to get them back to < 2^45. Might be used instead of, or in addition to, doing multiple blocks using additional precomputed powers of the key, there are many options. E.g., precompute one k^2, do two blocks at a time (each pk is then the accumulation of 6 products rather than three, so one bit larger than above). Do the half way reduction as above, then another two blocks, and the more expensive reduction back to 45 bits. So a loop doing four blocks per iteration using only one additional precomputed key power. With more careful analysis, could perhaps be extended to 6 blocks. Looking at dependent operations, I estimate that the simple reduction gives a four cycles dependency. To hide some of that, I think the computation of the p0, p1, p2 should be ordered so that h0 is used last of the inputs, and p2 is ready first (i.e., first compute and accumulate the products involving h1, h2 inputs, then compute the h0 product that is needed to complete into p2, then the other two products involving h0. Minimum latency from the dependencies of a single-block loop would then be 5-6 cycles plus the latency for a full multiplication. It's rather complex, and to get close to full utilizaton of multipliers, I imagine the latency for the state word with the cheapest reduction matters too. Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list -- nettle-bugs@lists.lysator.liu.se To unsubscribe send an email to nettle-bugs-le...@lists.lysator.liu.se