An interesting thing it seems I earnt is that Self Attention in the Transformer architecture is actually a way of tree building horizontally. Because each word checks each word, which must be the IF-Condition. And then they sum values for those options given. And proceeds t the next layer to do it again.
> Transformers are not like traditional ANNs, they don't do hierarchical > activations, they use self attention. > They don't cross connect upwards, they process in 1 layer so to speak. Now I'm thinking for later: Could this be done without backprop? Assuming backprop is a worse idea. But it doesn't explode like a tree, did they cap it to 1024 items with 1600 dimensions each? So what does that allow? ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T6cf3be509c7cd2f2-M40678c4f4d548e846e6e7f36 Delivery options: https://agi.topicbox.com/groups/agi/subscription
