these are my incomplete model permutation notes, for inside the attention implementations. each axis is labeled with an einsum letter.
chunked:
queries: ...qhd
keys: ...khd
values: ...vhd
mask: ...hqk -> ...qhk
scores: ...qhk
unchunked:
queries: .hqd
keys: .hkd
values: .hvd
mask: .
scores: .
