On Tue, 30 Jan 2024 at 12:04, John Naylor <johncnaylo...@gmail.com> wrote: > > On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aa...@cybertec.at> wrote: > > But given that we know the data length and we have it in a register > > already, it's easy enough to just mask out data past the end with a > > shift. See patch 1. Performance benefit is about 1.5x Measured on a > > small test harness that just hashes and finalizes an array of strings, > > with a data dependency between consecutive hashes (next address > > depends on the previous hash output). > > Interesting work! I've taken this idea and (I'm guessing, haven't > tested) improved it by re-using an intermediate step for the > conditional, simplifying the creation of the mask, and moving the > bitscan out of the longest dependency chain. Since you didn't attach > the test harness, would you like to run this and see how it fares? > (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan > to test myself as well, but since your test tries to model true > latency, I'm more interested in that one.
It didn't calculate the same result because the if (mask) condition was incorrect. Changed it to if (chunk & 0xFF) and removed the right shift from the mask. It seems to be half a nanosecond faster, but as I don't have a machine set up for microbenchmarking it's quite close to measurement noise. I didn't post the harness as it's currently so messy to be near useless to others. But if you'd like to play around, I can tidy it up a bit and post it. > > Not sure if the second one is worth the extra code. > > I'd say it's not worth optimizing the case we think won't be taken > anyway. I also like having a simple path to assert against. Agreed. As an addendum, I couldn't resist trying out using 256bit vectors with two parallel AES hashes running, unaligned loads with special casing page boundary straddling loads. Requires -march=x86-64-v3 -maes. About 20% faster than fasthash on short strings, 2.2x faster on 4k strings. Right now requires 4 bytes alignment (uses vpmaskmovd), but could be made to work with any alignment. Regards, Ants Aasma
#include <immintrin.h> #include <inttypes.h> #define PAGE_SIZE 0x1000 uint64_t fast_vec_hash_cstring_avx2(char *buf) { __m128i hash0 = {0, 0}; __m128i hash1 = {0, 0}; __m128i k0 = {0x0807060504030201, 0x100F0E0D0C0B0A09}; __m128i k1 = {0x1117161514131211, 0x201F1E1D1C1B1A19}; char *cur = buf; int mask; __m256i chunk; int offset = (uintptr_t) buf & (sizeof(chunk) - 1); int endpos; do { char *end_of_page = (char*) ((((uintptr_t) cur) | (PAGE_SIZE-1)) + 1); for (; cur + sizeof(chunk) <= end_of_page; cur += sizeof(chunk)) { chunk = _mm256_loadu_si256((__m256i*) cur); __m256i ends = _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); } if (offset) { __m256i load_mask = _mm256_cmpgt_epi32(_mm256_set1_epi32(offset / 4), _mm256_setr_epi32(0,1,2,3,4,5,6,7)); chunk = _mm256_maskload_epi32((const int*) cur, load_mask); __m256i ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; chunk |= _mm256_maskload_epi32((const int*) cur, load_mask); ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0)); mask = _mm256_movemask_epi8(ends); if (mask) goto last_iteration; hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); cur += sizeof(chunk); } } while(1); last_iteration: // chunk contains data, mask contains location of end of line endpos = _tzcnt_u32(mask); _mm256_cmpgt_epi8(_mm256_set1_epi8(endpos), _mm256_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31)); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0)); hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1)); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); hash0 = _mm_aesenc_si128(hash0, k1); hash1 = _mm_aesenc_si128(hash1, k0); hash0 = _mm_aesenc_si128(hash0, k0); hash1 = _mm_aesenc_si128(hash1, k1); __m128i intermediate = hash1 ^ hash0; return intermediate[1] ^ intermediate[0]; }