On Tue, 30 Jan 2024 at 12:04, John Naylor <johncnaylo...@gmail.com> wrote:
>
> On Tue, Jan 30, 2024 at 4:13 AM Ants Aasma <ants.aa...@cybertec.at> wrote:
> > But given that we know the data length and we have it in a register
> > already, it's easy enough to just mask out data past the end with a
> > shift. See patch 1. Performance benefit is about 1.5x Measured on a
> > small test harness that just hashes and finalizes an array of strings,
> > with a data dependency between consecutive hashes (next address
> > depends on the previous hash output).
>
> Interesting work! I've taken this idea and (I'm guessing, haven't
> tested) improved it by re-using an intermediate step for the
> conditional, simplifying the creation of the mask, and moving the
> bitscan out of the longest dependency chain. Since you didn't attach
> the test harness, would you like to run this and see how it fares?
> (v16-0001 is same as your 0001, and v16-0002 builds upon it.) I plan
> to test myself as well, but since your test tries to model true
> latency, I'm more interested in that one.

It didn't calculate the same result because the if (mask) condition
was incorrect. Changed it to if (chunk & 0xFF) and removed the right
shift from the mask. It seems to be half a nanosecond faster, but as I
don't have a machine set up for microbenchmarking it's quite close to
measurement noise.

I didn't post the harness as it's currently so messy to be near
useless to others. But if you'd like to play around,  I can tidy it up
a bit and post it.

> > Not sure if the second one is worth the extra code.
>
> I'd say it's not worth optimizing the case we think won't be taken
> anyway. I also like having a simple path to assert against.

Agreed.

As an addendum, I couldn't resist trying out using 256bit vectors with
two parallel AES hashes running, unaligned loads with special casing
page boundary straddling loads. Requires -march=x86-64-v3 -maes. About
20% faster than fasthash on short strings, 2.2x faster on 4k strings.
Right now requires 4 bytes alignment (uses vpmaskmovd), but could be
made to work with any alignment.

Regards,
Ants Aasma
#include <immintrin.h>
#include <inttypes.h>

#define PAGE_SIZE 0x1000

uint64_t
fast_vec_hash_cstring_avx2(char *buf)
{
    __m128i hash0 = {0, 0};
    __m128i hash1 = {0, 0};

    __m128i k0 = {0x0807060504030201, 0x100F0E0D0C0B0A09};
    __m128i k1 = {0x1117161514131211, 0x201F1E1D1C1B1A19};

    char *cur = buf;

    int mask;
    __m256i chunk;
    int offset = (uintptr_t) buf & (sizeof(chunk) - 1);
    int endpos;
    

    do {
    
        char *end_of_page = (char*) ((((uintptr_t) cur) | (PAGE_SIZE-1)) + 1);
        for (; cur + sizeof(chunk) <= end_of_page; cur += sizeof(chunk))
        {
            chunk = _mm256_loadu_si256((__m256i*) cur);
            __m256i ends = _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
            mask = _mm256_movemask_epi8(ends);
            if (mask)
                goto last_iteration;
            hash0 = _mm_aesenc_si128(hash0, k0);
            hash1 = _mm_aesenc_si128(hash1, k1);
            hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
            hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));
        }
        if (offset)
        {
            __m256i load_mask = _mm256_cmpgt_epi32(_mm256_set1_epi32(offset / 4), _mm256_setr_epi32(0,1,2,3,4,5,6,7));
            chunk = _mm256_maskload_epi32((const int*) cur, load_mask);
            __m256i ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
            mask = _mm256_movemask_epi8(ends);
            if (mask)
                goto last_iteration;
            chunk |= _mm256_maskload_epi32((const int*) cur, load_mask);
            ends = load_mask & _mm256_cmpeq_epi8(chunk, _mm256_set1_epi8(0));
            mask = _mm256_movemask_epi8(ends);
            if (mask)
                goto last_iteration;
            hash0 = _mm_aesenc_si128(hash0, k0);
            hash1 = _mm_aesenc_si128(hash1, k1);
            hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
            hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));
            cur += sizeof(chunk);
        }
    } while(1);


last_iteration:
    // chunk contains data, mask contains location of end of line
    endpos = _tzcnt_u32(mask);
    _mm256_cmpgt_epi8(_mm256_set1_epi8(endpos), _mm256_setr_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31));
    hash0 = _mm_aesenc_si128(hash0, k0);
    hash1 = _mm_aesenc_si128(hash1, k1);
    hash0 = _mm_aesenc_si128(hash0, _mm256_extracti128_si256(chunk, 0));
    hash1 = _mm_aesenc_si128(hash1, _mm256_extracti128_si256(chunk, 1));
    
    hash0 = _mm_aesenc_si128(hash0, k0);
    hash1 = _mm_aesenc_si128(hash1, k1);
    hash0 = _mm_aesenc_si128(hash0, k1);
    hash1 = _mm_aesenc_si128(hash1, k0);
    hash0 = _mm_aesenc_si128(hash0, k0);
    hash1 = _mm_aesenc_si128(hash1, k1);

    __m128i intermediate = hash1 ^ hash0;
    return intermediate[1] ^ intermediate[0];
}

Reply via email to