[agi] Sharper Attention

immortal . discoveries Thu, 12 Aug 2021 11:34:39 -0700

Sharper Attention


Self-attention enables transformer networks to track relationships between 
distant tokens — such as text characters — in long sequences, but the 
computational resources required grow quadratically with input size. New work 
aims to streamline the process by rating each token’s relevance to the task at 
hand.


*What’s new:* Sainbayar Sukhbaatar and colleagues at Facebook proposed 
Expire-Span 
<https://info.deeplearning.ai/e2t/tc/VWHTc31YWHF5W4x3LDZ8z9Qq1VmLt1V4w7-QFN4gTlKX3q0BBV1-WJV7CgFDJW7YyLV02QS2W8VnGbJ53HZflPN5B4_W2nLFQXN5H67q9-S5JHW91L3KS3k3qBWW2wJJm72tZkX8W3rf-4L6TW0JXW93XZ_y738vjKW5BWsnH4Gm3JJW139yH03yFLtrN5nYPFVt04PsW1h5Xb75pl7NtN8NWrVfFdHSSD_7fTJHKBjW35w8nN7gtcxpN8YtmCf_LwJvN6Y0GjSjPyn0N2DhQ_VV1xcLW53r8y09h-X9mW23BnRm5rMJL1W12tmnw2mBbTYW6NyQ985xz7XQW7tGFg98JHJ7fW2T2Z-q5f1K9wW1JMh5h7K1fxgW2mDbFm7XyWmVW8R0q6n6GxH-6W7_HBqF7kHlmJ3hVL1>,
 which enables attention to ignore tokens that aren’t useful to the task at 
hand.


*Key insight:* Depending on the task, some tokens affect a model’s performance 
more than others. For instance, in predicting the sentiment of the sentence, 
“Then she cried,” “cried” is more important than “then.” By forgetting less 
relevant tokens, attention can process longer sequences with less computation. 


*How it works:* The authors modified a transformer’s attention layers. They 
trained the model in typical fashion to predict the next character in a 
sequence using the enwik8 
<https://info.deeplearning.ai/e2t/tc/VWHTc31YWHF5W4x3LDZ8z9Qq1VmLt1V4w7-QFN4gTlK13q0zJV1-WJV7CgS-fW7nJgLv4n6MzfW6PvG_W8nWpPfW8WybKP3-WN4YW7s42tQ7nXx9-W77LyrT6tB0GQW8-CrJ_8Rm839W5P9xnq9l_PJ-W179Md-8LhyKhVwNgWw4wtdqjW3y3QcN2XgWsTW7c8YpS8cH5mhW6WlCmm48bTb9VxhXBs7q6TLhW5f5Shl407JgVN8DML3Rd7qnBW3dvMDF6vHdDjW8LHVYP5RN78pW79nVt47C2tTBW7y4M4k97wNN1W8yG_zK6X6dGsW1HS93q3R9ChGW3hhLyL1jfhfw34z81>
 dataset of text from *English Wikipedia*. Given the first token, it predicted 
the next. Then, using the first two tokens, it predicted the next, and so on.


 * To each attention layer, the authors added a vanilla neural network that 
predicted the number of times that attention should use each token. It assigned 
a value to each new token, subtracted 1 after each prediction, and deleted the 
token when the value reached 0. 
 * The loss function minimized the number of times the model used each token to 
keep it from assigning arbitrarily high values (otherwise, it could predict 
that every token should be used until the whole sequence had been processed). 
In this way, the model learned to retain only the tokens most useful to an 
accurate prediction.
*Results: *The authors evaluated Expire-Span based on total memory usage, 
training time per batch, and bits per byte (a measure of how well the model 
predicted the next token; lower is better). On enwik8, it achieved 1.03 bits 
per byte, while Adaptive-Span 
<https://info.deeplearning.ai/e2t/tc/VWHTc31YWHF5W4x3LDZ8z9Qq1VmLt1V4w7-QFN4gTlK13q0zJV1-WJV7CgM6WV9jChs643RpyW54PW1h2QmVRPW24Y46G63PPlPW3HZYvL12p_5GW3Z4x6m3gPxM5W1fqnK_7VXVwmW25ZHFr5NS4K3W6wtMcK6pWz2sM-Nb5NYR0ngW7glVVS6G9Tm8W4t51NG66dvQyN28-92j7gd2SW7fSfXL8hVzcBW9lvBTW6KnmJ-W6rBFMF6yfgG9W8jNpBy7yvzPCW782xrB1sNYghW5G8ZS834x2cSW2JzZpN97HRKRW4_dJ_W4g4lpPW50zyb471lFcCW5Hcst_6tyFGm3p081>
 achieved 1.04 bits per byte and compressive transformer 
<https://info.deeplearning.ai/e2t/tc/VWHTc31YWHF5W4x3LDZ8z9Qq1VmLt1V4w7-QFN4gTlK13q0zJV1-WJV7CgS91W2z2HCS172S5LW7CLxRv7ZMVcWW6VQkR26g0LS6W2Qfdbf297N1yW90VT5V78Mwm4W3Sbvp938W0DDW7ns_cL30y_-jW3SV3WL502tkTN5DcWllmCmYVW6NZwhQ5Y9l2yW66lDtt1t8wnJW78X_5M843HFSW3Z0dFv24hmj_W6g6h7n5NTj3GVZH4cL5qQj0KW8CNJpv2W3clJN91wxdmbTgnpW6XxXBp2HfskyW9dBPt050MMvCW4x1MZH4xVtXxVvsy_f1tKfL6VY6R387DcFtL3pMY1>
 achieved 1.05 bits per byte. The authors’ model used 25 percent less GPU 
memory than the other two approaches (15GB versus 20GB and 21GB respectively). 
It also took less time to train (408ms per batch of 512 tokens compared to 
483ms and 838ms). 


*Why it matters:* Forgetting the least relevant information enables 
transformers to process longer sequences in less time and memory. 


*We’re thinking:* Q: What do you do if a transformer forgets too much? A: Give 
it an Optimus Primer.



------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T080c5c944cacedd7-M99ff857547c5e6c58b703d2e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

[agi] Sharper Attention

Reply via email to