Re: [agi] Re: Here's exactly why predicting words is better than letters

immortal . discoveries Wed, 08 Sep 2021 05:57:26 -0700

BPE will also save me 3x-5x more RAM since I no longer must save every 1 letter 
offset into a trie tree. Only downside is large set of predictions.



Ok so what is my plan I have to write it down:

I'll get a bigish vocab up to 50,257 at most like GPT-2, using BPE. But online, 
meaning I will stop meaning I will need a way to find how many vocab items I 
should BPE for 1-100MB if use enwik8.txt....Some sort of entropy measuring 
maybe or maybe not to determine the BPE depth.

I'll store into the memory tree offsetted windows now as I highlighted. I'll 
know where to slide the window by sending through tree the next letters to find 
a recognizable vocab next item, ex. I know we, went, walk, i, n, g, and I see 
'[we went walk]ing', so I put the window one over onto 'we [went walki]ng' 
because i know up to i only, no in or ing.

So in this thinking, instead of pre-processing enwik8.txt into letters that act 
like chunks of words, I am sliding my window over chunks per explained how, 
(assuming I don't want to tokenize it ahead of time), and am predicting such 
chunks too. But I'll need to store in tree where each BPE part is. And I'll 
need to predict a big set of predictions.

I could store in tree instead of 'letterletterletter',[counter, counter, 
counter, pointer, pointer, pointer] word chunks like ''we went walk i n 
g,[counter, counter, counter, pointer, pointer, pointer], assuming the first 
item is a space meaning really the space letter; '  we...',[...].

Going to layer 1 in this tree will be costly if don't ID the features to use 
ord() to instantly find them. This would mean turning all word chunks into code 
number ex. the=44...up to 50,257 vocab size. If I'm looking for 'the', I don't 
know where it is saved nor does ord() bring me to its own branch, but if I 
store in the node the<>#44, and see IDs instead of text, I can see #44 and it 
will bring me to node #44 where 'the' is, assuming I rebuild tree each time I 
add new vocab items to tree since they must not be allowed to choose where to 
save, I must populate layer 1 with up to 60K items (2 bytes combinations). As 
for layer 2,same, it can have 50,257 branches, but obviously won't have that 
many, if it did that would mean we have seen 2.5 billion unique 2 word phrases, 
that is 2,500,000,000 * 5bytes = 10,000,000,000 bytes, and enwik9.txt is 
1,000,000,000 bytes. So max is ~4K branches per layer1 node, and that would 
never be the case, so average maybe 2K or 1K or less. From a intuitive and 
observed view, most words appear in enwik9.txt maybe 500-6,000 times, so you 
should only see ~200-2,000 branches at layer2, meaning I won't need to 
pre-store IDs here, and would be costly if did too.

This is a lot to think about....head is spinning...........
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T90b7756a48658254-M5eecd711b6cc43eee846006c
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Re: [agi] Re: Here's exactly why predicting words is better than letters

Reply via email to