the data comes out right now until it's consolidated at the end of the softmax

i stepped through it carefully, and it turns out the attention values
are being generated in a truncated manner.  there are only 20 in the
efficient_attention code, whereas there are 96 in the working code.

so, i still got something wrong. i'm guessing my test passed more
easily because it had the same feature size for all of queries, keys,
and values. that is not true in the perceiver_loader test i'm
pursuing; i think it looks as if the values have a feature size of 20
whereas the keys have a feature size of 96. gotta review again to get
that making 96 attention scores instead of 20, I guess. unsure.

here are the notes with the einsum letters flushed out, unchecked:

      chunked:
                queries: ...qhd
                keys: ...khd
                values: ...vhd
                scores: ...qhk
                mask: ...hqk -> ...qhk
        unchunked:
                queries: .hqd
                keys: .hkd
                values: .hvd
                scores: .hqk -> 1,8,256,96
                mask: .hqk -> 1,1,1,96 -> needs extension to
num_heads, num_queries


commit eb16dc63d2c617bfe708881f8bd5ba96be8b9f50 (HEAD ->
memory-efficient-attention, xloem/memory-efficient-attention)
Author: xloem <[email protected]>
Date:   Thu Jan 27 11:43:45 2022 +0000

    wip efficient attention: dimensions pass but data is truncated

Reply via email to