BPE will also save me 3x-5x more RAM since I no longer must save every 1 letter offset into a trie tree. Only downside is large set of predictions.
Ok so what is my plan I have to write it down: I'll get a bigish vocab up to 50,257 at most like GPT-2, using BPE. But online, meaning I will stop meaning I will need a way to find how many vocab items I should BPE for 1-100MB if use enwik8.txt....Some sort of entropy measuring maybe or maybe not to determine the BPE depth. I'll store into the memory tree offsetted windows now as I highlighted. I'll know where to slide the window by sending through tree the next letters to find a recognizable vocab next item, ex. I know we, went, walk, i, n, g, and I see '[we went walk]ing', so I put the window one over onto 'we [went walki]ng' because i know up to i only, no in or ing. So in this thinking, instead of pre-processing enwik8.txt into letters that act like chunks of words, I am sliding my window over chunks per explained how, (assuming I don't want to tokenize it ahead of time), and am predicting such chunks too. But I'll need to store in tree where each BPE part is. And I'll need to predict a big set of predictions. I could store in tree instead of 'letterletterletter',[counter, counter, counter, pointer, pointer, pointer] word chunks like ''we went walk i n g,[counter, counter, counter, pointer, pointer, pointer], assuming the first item is a space meaning really the space letter; ' we...',[...]. Going to layer 1 in this tree will be costly if don't ID the features to use ord() to instantly find them. This would mean turning all word chunks into code number ex. the=44...up to 50,257 vocab size. If I'm looking for 'the', I don't know where it is saved nor does ord() bring me to its own branch, but if I store in the node the<>#44, and see IDs instead of text, I can see #44 and it will bring me to node #44 where 'the' is, assuming I rebuild tree each time I add new vocab items to tree since they must not be allowed to choose where to save, I must populate layer 1 with up to 60K items (2 bytes combinations). As for layer 2,same, it can have 50,257 branches, but obviously won't have that many, if it did that would mean we have seen 2.5 billion unique 2 word phrases, that is 2,500,000,000 * 5bytes = 10,000,000,000 bytes, and enwik9.txt is 1,000,000,000 bytes. So max is ~4K branches per layer1 node, and that would never be the case, so average maybe 2K or 1K or less. From a intuitive and observed view, most words appear in enwik9.txt maybe 500-6,000 times, so you should only see ~200-2,000 branches at layer2, meaning I won't need to pre-store IDs here, and would be costly if did too. This is a lot to think about....head is spinning........... ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T90b7756a48658254-M5eecd711b6cc43eee846006c Delivery options: https://agi.topicbox.com/groups/agi/subscription
