So if w2v uses ex. 1000 dimensions, then each word has 1000 coordinates, and you have say a 2000 word vocab, how do you store the structure otherwise? Can you explain? I see a viz here https://hazyresearch.github.io/hyperE/ however I don't see how a node can be in 45 different places at the same time...
If GPT-2 predicts the next token based on the left side context, and BERT does by bi-directionalism, BERT is twice as better, is there something more why you want to use BERT? Or just that reason? ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/Tcfe7cc93841eec23-M0cd0696f7828e9c9e05b1c20 Delivery options: https://agi.topicbox.com/groups/agi/subscription
