i looked at the perceiver running a bit.  i tried reducing the
learning rate, which is way too high.
i'm not sure why the characters trend to all the same values (even at
the start).  reviewing the data flow might be helpful, but it might
take a few reviews through different parts to eventually find the
issue.  experience in transformer models would likely help.

my indication that something is wired up wrongly is that the values
trend to early 0, whereas if the attention mask is being respected
there shouldn't be pull toward 0 because most of the 0s are masked
out.

Reply via email to