New pictures are provided too.

I have been working on AGI for now 5 years full-time (I don't get paid), on 
mostly text sensory but it turned out to be very very insight-full, I have a 
very large design with many of the "bells and whistles", while the architecture 
itself that can run it all is very simple and has made some of my friends 
cringe. Below I lay down a good portion of my work. I really hope, you can help 
my direction, or that I help you.

Nearly every part of my design/ architecture can be found in the AI field. 
Hierarchies, yup. Weights, yup. Rewards, yup. Reward Update, yup. Activation 
Function, yup. Word2Vec, yup. Seq2Seq, yup. Energy, yup. Pruning, yup. Online 
Learning, yup. Pooling, yup. Mixing Predictions, yup. Etc. It's when I unify 
them together I start getting a new view that no one else shares. I'm able to 
look into my net and understand everything and how they all work together, 
that's why I was able to learn so much about the architecture.

I've coded my own Letter Predictor and compresses 100MB to 21.8MB, world record 
is 14.8MB. Mine updates frequencies Online and mixes predictions, and more. I 
still have tons to add to it, I will likely come close to the world record 
easily. How it works is in the supplementary attached file.

So I'm going to present below a lot of my design, showing how I unify a things 
together in a single net. And you can tell me if there's a more natural way or 
not. I've tested other people's algorithms like GPT-2 and they can accomplish 
what I present, but the natural way to do it is not shown in an image or ever 
explained like I explain it, they just stack black boxes on each other.

See this image to get a basic view of my architecture. It's a toy example. 
https://ibb.co/p22LNrN It's a hierarchy of features that get re-used/ shared to 
build larger memories. The brain only stores a word or phrase once and links 
all sentences to it ever heard. That makes for a extremely powerful breeding 
ground. Note the brain doesn't store a complete pyramid like I show in my 
image, just bits n parts; a collection of small hierarchies. So think of my 
image as a razor tooth saw, not a single very tall pyramid triangle. 
https://ibb.co/d4JVm55

Notice all nodes are too "perfectly" clear? Well nodes can be merged 
"whalkinge" and have variable weights "wALkinG" and be pruned "my are cute" to 
get a "compressed" fuzzy-like network but we can for now keep a clean hierarchy 
so we can easily see what is going on!

I have a working algorithm (trie/tree-based) that updates the connection 
weights in the tree when accesses a feature (in the same order time ex. a>b>c, 
cba is a different feature), so it knows how many times it has seen 'z' or 
'hello' or 'hi there' in its life so far! Frequencies! This is my Online 
Training for weights. Adding more data always improves my predictor/model, 
guaranteed. I tested using not Perplexity but Lossless Compression to Evaluate 
my model's predictions. So now you can imagine my razor tooth hierarchies with 
counts (weights) placed on connections. Good so far. Starting to look like a 
real network and can function like one too! https://ibb.co/hC8gkFC

Now for the cool part I want to slap on here. I hope you know Word2Vec or 
Seq2Seq. It translates by discovering cat=dog based on shared contexts. The key 
question we need/ will focus on here now is how does the brain find out cat=dog 
using the same network hierarchy? Here's my answer below and I want to know if 
you knew this or if you have a more natural way.

https://ibb.co/F4BL1Ys Notice I highlighted the cats and dogs nodes? The brain 
may see "my cats eat food" 5 times and then, tomorrow, may see "my dogs eat 
food" 6 times. Only through their shared contexts will energy leak and trigger 
cats from dogs. There's no other realistic way this would occur other than 
this. The brain is finding out cats is similar to dogs on its own by shared 
strengthened paths leaking energy. So next time it sees "dogs" in an unseen 
sentence like "dogs play", it will activate both dogs and cats nodes by some 
amount.

We ignore common words like "the" or "I" because they will be around most 
words, it doesn't mean cats=boat. High frequency nodes are ignored.

Word2Vec or the similar can look at both sides around a word to translate, use 
long windows, skip-gram windows, closer words in time have more impact, and 
especially the more times seen (frequencies). My hierarchy can naturally do all 
that. Word2Vec also uses Negative Sampling, and my design can also use 
inhibition for both next word and translation Prediction.

Word2Vec uses vectors to store words in many dimensions and then compare which 
are closer in the space. Whereas my design just triggers related nodes by how 
many local contexts are shared. No vectors are stored in the brain... Nor do we 
need Backprop to update connections. We increment and prune low frequency nodes 
or merge them etc, we don't need Backprop to "find" this out, we just need to 
know how/ why we update weights!

There's a such thing as contextual word vectors. Say we see "a beaver was near 
the bank", here we disambiguate "bank". In my design, it triggers river or wood 
more than TD Trust or Financial building. Because although "near the bank and 
the building" and "near the bank with wood" both share bank, the beaver in my 
sentence input triggers the latter sentence more than the financial one.

Word2Vec can do the "king is to queen as man is to what?" by misusing 
dimensions from king that man doesn't have to find where queen is dimensionally 
without the king dimensions in man to land up at woman. Or USA is to Canada as 
China is to India, because instead of them lacking a context they both share it 
here but the location is slightly off in number. But the brain doesn't do this 
naturally, just try cakes are to toast as chicken is to what? Naturally the 
brain picks a word with all 3 properties.

To do the king woman thing we need to see the only difference is man isn't 
royal, so queen is related to woman most but not royal, hence woman. This 
involves a NOT operation, somehow.

Ok so, when my architecture is presented with "walking down the" it activates 
multiple nodes like "alking down the" and "lking....." and "king...." ..... and 
"down the" and "the" and also skip-gram nodes ex. "walking the", as well as 
related nodes ex. "running up that" and "walking along the". My code BTW does 
this but not related or skip-gram nodes yet! What occurs now is all activated 
nodes have shared parent predictions on the right-hand side to predict the next 
letter or word. So "down the" and "the" and "up this" all leak energy forward 
to "street". This Mixing (see the Hutter Prize or PPM) improves Prediction. You 
can only repeat the alphabet forward because it was stored that way. Our nodes 
have now mixed their predictions to decide a better set of predictions. 
https://ibb.co/Zz91jQQ

My design is therefore recognizing nodes despite typos or related words. It can 
also handle rearranged words like "down walking the" by time delay from 
children nodes. Our "matches" in the hierarchy are many, and we have many 
forward predictions now, we can take the top 10 predicted words now. We usually 
pick the top prediction, mutation makes it not perfect on purpose, it's 
important.

You may wonder, why does thinking in the brain only hear 1 of the top 10 
predictions? All 10 nodes are activated, and so are recently heard nodes kept 
Active! If they were heard, you'd hear them in your mind, surely? If you 
imagine video in your brain, it'd be very odd to predict the next frame as a 
dog, cat, horse, and sheep, it would be all blended like a monster. The brain 
needs precision. So Pooling, as done in CNNs, is used in picking from top 10 
predictions! Other nodes and predictions still are activated, just not as much.

Also, Pooling in my architecture can be done for every node outputs! Not just 
the final high layer. Pooling helps recognition by focusing. Pooling can be 
controlled indirectly to make the network Summarize or Elaborate or keep 
Stable. It simply says or doesn't say more or less important nodes, based on 
the probability of being said. Like you may ignore all the "the" or you may say 
a lot of filler content that isn't even rewarding like talking about food (see 
below).

When given a prompt ex. "What do you want to eat? What?" you may first parrot 
exactly the start, and some may be said in your own loved words I, fries, etc. 
Or you may just say the entail. You might just say what they said and stop 
energy forward flow. And you might just say fries in replace of "What?". Why!? 
Because their words, and your loved words fries, I, etc are pre-active.

One more thing I'll go through is Temporary Energy and Permanent Energy in my 
architecture. You can see Facebook's new chatbot Blender is like GPT-2 but it 
has a Dialog Persona that makes it always say certain words/ nodes. So if it 
likes food or communism, it will bring it up somehow in everything. Just look 
at what I'm writing, it's all AI related! Check out the later half of this 
guy's video: https://www.youtube.com/watch?v=wTIPGoHLw_8

In my design, positive and inhibitory reward is installed on just a few nodes 
at birth time, and it can transfer reward to related nodes to update it's 
goals. It may see contextually food=money, so now it starts talking about 
money. Artificial rewards are changeable, root goal is not modifiable as much.

For Temporarily Active nodes, you can remember a password is car and forget it, 
but of course you retain car node. This is a different forgetting than pruning 
weak weights forever. GPT-2 is probably using the last 1,000 words for 
prediction by this very mechanism. The brain already has to keep in memory the 
last 10 words, so any predicted nodes that are pre-active from being held in 
memory get a boost. If you read "the cat and cat saw cats cat then a cute" you 
predict cat, and the cat node is already activated 4 times just recently. 
You're holding the words in your hierarchy nodes, not on paper anymore. So yes 
energy is retained for a while and affects the Probabilities predicted!

I once played Pikmin for half the day, and when I went in the kitchen things 
looked like Pikmin faces or I seen them faintly but still somewhat vividly 
running around things. It causes dreams more random predictions from the top 10 
or 100 predictions. It's not really good predictions in dreams.

You can see how this helps. Say you only read 100,000 bytes of data so far, and 
you now read "the tree leaves fell on the root of the tree and the", you have 
little data trained on so far, but you can predict well the next word is 
Probably a related word to tree, leaves, etc, so leaf, tree, branch, twig all 
get boosted by related words from recently read words. And it's really 
powerful, I've done tests in this area as well. The Hutter Prize has a slew of 
variants I presented. Like looking at the last 1,000 letters to boost the 
likeliest next letter. That's good but not as commonly accurate or flexible as 
word prediction using related words, instead of Exact letters! Big difference.

I look forward to your thoughts, I hope I provided some insight into my design 
and tests. I hope you can help me if there is something I'm missing, as my 
design does do a lot in a single architecture. I don't see why it's a good idea 
to study it as a stack of black boxes without fully understanding how it makes 
decisions that improve Evaluation (prediction). While my design may be 
inefficient it may be the natural way it all fits together using the same nodes.

To learn more, I have a video and a different but similar run through my design 
in this file (and how my code works exactly): 
https://workupload.com/file/Y4XhZPYHzqy
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T30bb2d96ab6bb993-Ma148f477a002b2811c9ada32
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to