It is almost done. Please help me make it fully complete.
This is how to code GPT-2, so far this looks easy peasy, but the training
of the network, etc needs help! Needs more detail.
Once this is done, I can move onto coding it and updating How GPT-2's
systems work and what it all really means in the end.
Also attached is a photo, does GPT-2 do this? How?
[image: 2019-06-11.png]
------------------------------------------
Artificial General Intelligence List: AGI
Permalink:
https://agi.topicbox.com/groups/agi/Tb3df7de714c797d9-M222ca4061fae9ccbc1df4cb2
Delivery options: https://agi.topicbox.com/groups/agi/subscription
Use the WebText dataset/ gather 40GBs from Reddit offlinks with 3 or higher
karma.
Create Bype Pair Encoding, assign ID integers to each token and then give them
a one-hot encoding and then based on relational similarity create each an embed
vector.
Training samples a window randomly over the ex. 800GBs because it can't train
on all it but if spreads out its reading can get "more/diversity" out of it, so
it really only reads 200GBs for example. Why not train window on every 4th
offset then so doesn't oveertrain on an area? It masks the last token and tries
to predict it. Uses autograd ADAM backpropagation with \beta_1 = 0.9 and
\beta_2 = 0.98. They used a batchsize of 512. They use Perplexity and adjust
the weights and measure Loss visualizable on a plot. They used a Learning Rate
schedual where they slowly warmed up the Learning Rate, then decreased it
according to the follwoing formula: LRate = d_{model}^{-0.5} \cdot
\min(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5}). To penalize the
model when it becomes too confident in its predictions, the authors performed
label smoothing.
Take the ex. 2 input words:
['the', 'girlis']
Get their tokens (breakdown girlis using Byte Pair Encoded vocab parts, vocab
is 50,257):
[<S>, 'the', 'girl', 'is']
Get their integer IDs. Get their positional IDs. Types of integers IDs: Empty
padding=0 for the rest of the 1024 tokens if we only have 4 tokens in window.
Start token=1 if start with no context. The=2, boat=456:
[1, 3, 129, 39, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0..............]
[1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0..............]
Get the '1024 dimensional coordinates/ embed' vector (each float is 16 bits
long) per each input ID and each positional ID:
[[0.42, 0.67, 0.12........], [0.42, 0.67, 0.12........], [0.42, 0.67,
0.12........], [0.42, 0.67, 0.12........], ?]
[[0.23, 0.12, 0.76........], [0.23, 0.12, 0.76........], [0.23, 0.12,
0.76........], [0.23, 0.12, 0.76........], ?]
For each item above, add it to the item directly above ex. 0.65 for the first
item.
Then you take a vector and find the average mean.
show how here
Then you take each item in it and subtract from an item the mean ex. a
coordinate like 0.030731 - -0.003320143824218751 to get 0.0340511438242187.
show how here
And then you take the standard variance 0.20575964639782154 and add to it
epsilon (not related to the brain, only helps the math not screw up) 0.00001 to
get 0.2057696463978215 and with the previous calculation 0.0340511438242187 we
divide it by the 0.2057696463978215 to get 0.1654818600328761.
We then times it by the weight 0.0584 dedicated for that coordinate (stamping)
and add a bias 0.0036237 to get 0.01328784062592 (stamping).
https://www.mathsisfun.com/data/standard-deviation.html
Layer Norm takes a coordinate in a word vector and makes it the variance, then
divides it by the standard variance, resulting in a coordinate becoming only a
small true variance (that's unexpected slightly) compared to other coordinates.
Then it times it by weight and adds bias. The vector has the word embed data
and positional data. This is just to normalize the vectors so they are
comparable. Scaling is also done to work with small numbers.
Up to now the embeds in 5.7 Layer1_x_layernorm1.txt are 5 word vectors stacked
like pancakes each 1024 coordinates long where each column is 5 long. 8.4
layer1_W_QKV.weight.txt (which are trained weight matrices) has 1024 weight
vectors stacked that each have 3072 items in themselves. We flip the vectors in
5.7 Layer1_x_layernorm1.txt so the rectangle of vectors is a tower long in
hieght now instead. But we don't do this for 8.4 layer1_W_QKV.weight.txt. We
then have each flipped word vector's (let's focus on just one word vector's
here) items multiply against their dedicated item in 8.4
layer1_W_QKV.weight.txt to do matrix multiplication. We sum the results up and
then do this again for the rest of the 3072 items (3071 left) in the
non-flipped vectors. This pancake moves to the right through all vertical 1024
pancakes 3072 steps.
Each of the 5 stacked word vector gets a [3072] item long vector and we add to
each item a bias from 8.5 layer1_W_QKV.bias.txt like you were to stamp it with
ink.
We then split all these [3072] vectors into 1024 (3x1024=3072) to make Query,
Key, and Value files. Each file has 5 'word' vectors in it stacked, each 1024
items long. We've cloned it 3 times.
We then split each file into splitted head files so that a file ex. Query file
has 16 sections now one for each head and one of the 16 sections has 5 vectors
that each have 64 item features. Each word of the 5 words has a query vector.
There's just the other 15 heads and K and V files "redundantly". The file for K
is twisted a bit it is 16 head sections and each section is 64 rows stacked
each 5 long for 5 words. The V file isn't twisted.
12.4 Layer1_x_attention_scores.txt has 16 head sections, each 5 vectors stacked
each 5 features long. We get this by each of the 5 stacked vectors (64 features
long) in file Q multiplying against each of the 5 64 long vectors in file K
then adding them up. So for a given word query we get 5 results in a new
vector, 5 stacked, 16 head sections.
We then scale each: divides each item by square root of 64 (divide it by 8).
After scaled Q*K we add a bias stamp.
We then Mask each head, so for a given head the 5 rows each 5 scores long look
like:
[-1.4111e+00, -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04],
[-9.3259e-01, 4.7022e-01, -1.0000e+04, -1.0000e+04, -1.0000e+04],
[-1.2644e+00, -5.7796e-01, -8.5491e-01, -1.0000e+04, -1.0000e+04],
[-1.4166e+00, -8.0913e-01, -4.4999e-01, -6.0653e-03, -1.0000e+04],
[-1.5199e+00, -7.7955e-01, -6.0395e-01, -1.1510e-01, -1.2940e+00]]
Then softmax each row so adds up to 1 and are positive numbers:
[[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[1.9737e-01, 8.0263e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[2.2258e-01, 4.4220e-01, 3.3523e-01, 0.0000e+00, 0.0000e+00],
[1.0457e-01, 1.9197e-01, 2.7492e-01, 4.2854e-01, 0.0000e+00],
[9.1546e-02, 1.9193e-01, 2.2878e-01, 3.7300e-01, 1.1474e-01]]
Now with splitted heads Value file, each head is 5 pancakes stacked each 64
features long, and for each pancake it has a pancake in the softmaxed file and
so for each item of the 5 in ex. the first softmaxed row we multiply each by
all it dedicated Value vector FOR that word to get 64 items (it * 1, 2,
3....64) so the first row above now has 64 items. We can see here the Value
file has some identicle results for 1 word self attention. Values appear to
shrink frequent English words. The attention scoring is done all to all per
word (64 to 64) and then mask them so a word is only looking at previous words,
hence this gives us for ex. word_3 three scores, which multiply against their
Values and then we combine the scores to get te actual score for row3 (word3)
to previous words.
Each head's first row is then combined to get 1024 items again for the first
word, and so for the other 4 words, this is the merged heads file. Some
diagrams show the last steps after head splitting and up to merging as a block
of 12 times it is done for GPT-2, this is just done for each head! Not 12
layers or adding to anything like next word lol.
From here on seems to be confusing a bit, the weight/bias and layer norm and
residual seem to work on each word individually. It seems the 5 1024 vectors
reach the end and each take a turn to step onto 1024 input nodes in the matrice
multiplication I mean, and each node has 50257 vocab connections, so we end up
with 5 50257 vectors, and possibly we combine these 5 vectors or not not sure
yet.
The next trained weight is 1024 vectors stacked each 1024 items long. Each word
vector will go along it 1024 times (which way?) to transform, then stamps it
with bias vector. Do it for the other 4 word vectors as well.
Now residual is that + what we had from 5.7 Layer1_x_layernorm1.txt. Then we
do Layer Norm - multiply by weight by stamping and add bias by stamping.
Transform each word vector from 1024 to 4096 items long by Matrix
Multiplication using 1024 stacked each 4096 long, add bias. Then Gelu
activation 'stamping' - Positive values aren't affected greatly and negative
ones are nearly approximated to zero.
Now we de-transform each word vector back to 1024 using 4096 stacked vectors
each 1024 items long and stamp bias using a 1024 item long bias stamp.
Then we do residual using 5 stacked vectors each 1024 items long with the last
input from GELU step and then layer norm using a 1024 items long vector for
weight/bias.
We have 50257 word vectors stacked each 1024 items long in your vocabulary.
Each word is chosen based on the vector it has on show, by token/position
context. All last steps up to 5.7 Layer1_x_layernorm1.txt are done again 12
times in GPT, perhaps to use de-ambiguated words to help in translating words
etc, until refined enough.
Do a linear Matrice Multiplication.
We then get logits, 5 stacks each 50257 canidates long, as if we translate each
of the 5 words until last decoder layer.
Then do softmax.
Can use Top-K to output ex. top 10 prediction of next word.
We re-use vectors already made and append to them for the next words to predict.