It is almost done. Please help me make it fully complete.

This is how to code GPT-2, so far this looks easy peasy, but the training
of the network, etc needs help! Needs more detail.

Once this is done, I can move onto coding it and updating How GPT-2's
systems work and what it all really means in the end.

Also attached is a photo, does GPT-2 do this? How?
[image: 2019-06-11.png]

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tb3df7de714c797d9-M222ca4061fae9ccbc1df4cb2
Delivery options: https://agi.topicbox.com/groups/agi/subscription
Use the WebText dataset/ gather 40GBs from Reddit offlinks with 3 or higher 
karma.

Create Bype Pair Encoding, assign ID integers to each token and then give them 
a one-hot encoding and then based on relational similarity create each an embed 
vector.

Training samples a window randomly over the ex. 800GBs because it can't train 
on all it but if spreads out its reading can get "more/diversity" out of it, so 
it really only reads 200GBs for example. Why not train window on every 4th 
offset then so doesn't oveertrain on an area? It masks the last token and tries 
to predict it. Uses autograd ADAM backpropagation with \beta_1 = 0.9 and 
\beta_2 = 0.98. They used a batchsize of 512. They use Perplexity and adjust 
the weights and measure Loss visualizable on a plot. They used a Learning Rate 
schedual where they slowly warmed up the Learning Rate, then decreased it 
according to the follwoing formula: LRate = d_{model}^{-0.5} \cdot 
\min(step\_num^{-0.5}, step\_num \cdot warmup\_steps^{-1.5}). To penalize the 
model when it becomes too confident in its predictions, the authors performed 
label smoothing.



Take the ex. 2 input words:
['the', 'girlis']

Get their tokens (breakdown girlis using Byte Pair Encoded vocab parts, vocab 
is 50,257):
[<S>, 'the', 'girl', 'is']

Get their integer IDs. Get their positional IDs. Types of integers IDs: Empty 
padding=0 for the rest of the 1024 tokens if we only have 4 tokens in window. 
Start token=1 if start with no context. The=2, boat=456:
[1, 3, 129, 39, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0..............]
[1, 2, 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0..............]

Get the '1024 dimensional coordinates/ embed' vector (each float is 16 bits 
long) per each input ID and each positional ID:
[[0.42, 0.67, 0.12........], [0.42, 0.67, 0.12........], [0.42, 0.67, 
0.12........], [0.42, 0.67, 0.12........], ?]
[[0.23, 0.12, 0.76........], [0.23, 0.12, 0.76........], [0.23, 0.12, 
0.76........], [0.23, 0.12, 0.76........], ?]

For each item above, add it to the item directly above ex. 0.65 for the first 
item.



Then you take a vector and find the average mean.
show how here

Then you take each item in it and subtract from an item the mean ex. a 
coordinate like 0.030731 - -0.003320143824218751 to get 0.0340511438242187.
show how here

And then you take the standard variance 0.20575964639782154 and add to it 
epsilon (not related to the brain, only helps the math not screw up) 0.00001 to 
get 0.2057696463978215 and with the previous calculation 0.0340511438242187 we 
divide it by the 0.2057696463978215 to get 0.1654818600328761.

We then times it by the weight 0.0584 dedicated for that coordinate (stamping) 
and add a bias 0.0036237 to get 0.01328784062592 (stamping).
https://www.mathsisfun.com/data/standard-deviation.html
Layer Norm takes a coordinate in a word vector and makes it the variance, then 
divides it by the standard variance, resulting in a coordinate becoming only a 
small true variance (that's unexpected slightly) compared to other coordinates. 
Then it times it by weight and adds bias. The vector has the word embed data 
and positional data. This is just to normalize the vectors so they are 
comparable. Scaling is also done to work with small numbers.



Up to now the embeds in 5.7 Layer1_x_layernorm1.txt are 5 word vectors stacked 
like pancakes each 1024 coordinates long where each column is 5 long. 8.4 
layer1_W_QKV.weight.txt (which are trained weight matrices) has 1024 weight 
vectors stacked that each have 3072 items in themselves. We flip the vectors in 
5.7 Layer1_x_layernorm1.txt so the rectangle of vectors is a tower long in 
hieght now instead. But we don't do this for 8.4 layer1_W_QKV.weight.txt. We 
then have each flipped word vector's (let's focus on just one word vector's 
here) items multiply against their dedicated item in 8.4 
layer1_W_QKV.weight.txt to do matrix multiplication. We sum the results up and 
then do this again for the rest of the 3072 items (3071 left) in the 
non-flipped vectors. This pancake moves to the right through all vertical 1024 
pancakes 3072 steps.

Each of the 5 stacked word vector gets a [3072] item long vector and we add to 
each item a bias from 8.5 layer1_W_QKV.bias.txt like you were to stamp it with 
ink.

We then split all these [3072] vectors into 1024 (3x1024=3072) to make Query, 
Key, and Value files. Each file has 5 'word' vectors in it stacked, each 1024 
items long. We've cloned it 3 times.

We then split each file into splitted head files so that a file ex. Query file 
has 16 sections now one for each head and one of the 16 sections has 5 vectors 
that each have 64 item features. Each word of the 5 words has a query vector. 
There's just the other 15 heads and K and V files "redundantly". The file for K 
is twisted a bit it is 16 head sections and each section is 64 rows stacked 
each 5 long for 5 words. The V file isn't twisted.

12.4 Layer1_x_attention_scores.txt has 16 head sections, each 5 vectors stacked 
each 5 features long. We get this by each of the 5 stacked vectors (64 features 
long) in file Q multiplying against each of the 5 64 long vectors in file K 
then adding them up. So for a given word query we get 5 results in a new 
vector, 5 stacked, 16 head sections.

We then scale each: divides each item by square root of 64 (divide it by 8). 
After scaled Q*K we add a bias stamp.

We then Mask each head, so for a given head the 5 rows each 5 scores long look 
like:
[-1.4111e+00, -1.0000e+04, -1.0000e+04, -1.0000e+04, -1.0000e+04],
[-9.3259e-01,  4.7022e-01, -1.0000e+04, -1.0000e+04, -1.0000e+04],
[-1.2644e+00, -5.7796e-01, -8.5491e-01, -1.0000e+04, -1.0000e+04],
[-1.4166e+00, -8.0913e-01, -4.4999e-01, -6.0653e-03, -1.0000e+04],
[-1.5199e+00, -7.7955e-01, -6.0395e-01, -1.1510e-01, -1.2940e+00]]

Then softmax each row so adds up to 1 and are positive numbers:
[[1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00],          
[1.9737e-01, 8.0263e-01, 0.0000e+00, 0.0000e+00, 0.0000e+00],
[2.2258e-01, 4.4220e-01, 3.3523e-01, 0.0000e+00, 0.0000e+00],
[1.0457e-01, 1.9197e-01, 2.7492e-01, 4.2854e-01, 0.0000e+00],
[9.1546e-02, 1.9193e-01, 2.2878e-01, 3.7300e-01, 1.1474e-01]]

Now with splitted heads Value file, each head is 5 pancakes stacked each 64 
features long, and for each pancake it has a pancake in the softmaxed file and 
so for each item of the 5 in ex. the first softmaxed row we multiply each by 
all it dedicated Value vector FOR that word to get 64 items (it * 1, 2, 
3....64) so the first row above now has 64 items. We can see here the Value 
file has some identicle results for 1 word self attention. Values appear to 
shrink frequent English words. The attention scoring is done all to all per 
word (64 to 64) and then mask them so a word is only looking at previous words, 
hence this gives us for ex. word_3 three scores, which multiply against their 
Values and then we combine the scores to get te actual score for row3 (word3) 
to previous words.

Each head's first row is then combined to get 1024 items again for the first 
word, and so for the other 4 words, this is the merged heads file. Some 
diagrams show the last steps after head splitting and up to merging as a block 
of 12 times it is done for GPT-2, this is just done for each head! Not 12 
layers or adding to anything like next word lol.

From here on seems to be confusing a bit, the weight/bias and layer norm and 
residual seem to work on each word individually. It seems the 5 1024 vectors 
reach the end and each take a turn to step onto 1024 input nodes in the matrice 
multiplication I mean, and each node has 50257 vocab connections, so we end up 
with 5 50257 vectors, and possibly we combine these 5 vectors or not not sure 
yet.

The next trained weight is 1024 vectors stacked each 1024 items long. Each word 
vector will go along it 1024 times (which way?) to transform, then stamps it 
with bias vector. Do it for the other 4 word vectors as well.

Now residual is that + what we had from 5.7 Layer1_x_layernorm1.txt.  Then we 
do Layer Norm - multiply by weight by stamping and add bias by stamping.

Transform each word vector from 1024 to 4096 items long by Matrix 
Multiplication using 1024 stacked each 4096 long, add bias. Then Gelu 
activation 'stamping' - Positive values aren't affected greatly and negative 
ones are nearly approximated to zero.

Now we de-transform each word vector back to 1024 using 4096 stacked vectors 
each 1024 items long and stamp bias using a 1024 item long bias stamp.

Then we do residual using 5 stacked vectors each 1024 items long with the last 
input from GELU step and then layer norm using a 1024 items long vector for 
weight/bias.

We have 50257 word vectors stacked each 1024 items long in your vocabulary. 
Each word is chosen based on the vector it has on show, by token/position 
context. All last steps up to 5.7 Layer1_x_layernorm1.txt are done again 12 
times in GPT, perhaps to use de-ambiguated words to help in translating words 
etc, until refined enough.

Do a linear Matrice Multiplication.

We then get logits, 5 stacks each 50257 canidates long, as if we translate each 
of the 5 words until last decoder layer.

Then do softmax.

Can use Top-K to output ex. top 10 prediction of next word.

We re-use vectors already made and append to them for the next words to predict.

Reply via email to