[agi] (video) How can I improve my AGI architecture?

immortal . discoveries Fri, 08 May 2020 13:13:08 -0700

In the video below, I walk you through a fair amount of my AGI architecture 
I've been working on for 5 years. I'm looking for if I am missing something or 
if am on to something. The design is meant to be very very simple and explain a 
lot of how thinking occurs. Below is how my text predictor code works (100MB 
compresses to approx. 21.8MB), please read it twice before jumping into the 
video, you will learn some fundamental things all good text predictors are 
doing using frequency. Frequency is also used for discovering the word cat=dog. 
Note that that compression is for evaluation and is different than compressing 
a network to learn a better model. I should have also included in the video 
that Summarization, Translation, and Elaboration would be controlled by how 
much energy is allowed - you only say important features when you Summarize,  
not frequent or unrelated or unloved words.


How my text predictor/ compressor works (100MB>21.8MB):
My algorithm has a 17 letter long window step along the input file 1 letter 
(byte) at a time, updating a tree as it sees new data. The tree's branches are 
17 nodes long because it adds a window to tree (after it finishes its search 
process described next), and updates node counts if passes any node. For each 
step the window takes, the algorithm searches the tree for 17 different 
searches each a letter longer. The children leafs (the final letter of a 
searched branch) are the predictions with counts seen so far in the file. Layer 
1 nodes are children too and need no match. The tree is storing the frequency 
of all 1/2/3.../17 letters seen so far. The children are what allows you to 
predict/compress the next letter accurately. These 17 sets of predictions must 
be mixed because while the longest set is more accurate - we have less 
statistics, sometimes only 2 counts. We start with the longest found. Ex. 14 
letter match in the tree. The 14th set of predictions may say it seen come next 
a=44, b=33, f=25, w=7. I sum a set's counts up to get a total of (in this case) 
109, then I divide each count by the total to get %s that all add up to 1% ex. 
0.404% 0.35%.... Now for all these predicted %s, we still have 13 sets to mix 
and must remove some % from them each. So what I do is I check the total counts 
of the set against a Wanted Roof ex. 109<>300 (maybe we don't even need to mix 
lower sets if we got enough stats), and so I cut each % of each prediction by 
about 1/3rd then in this case. And in this case we still desire 66% more stats. 
For the next set, if say we have 200<>300, I take away 2/3rds from the 66% - 
meaning we still desire 22%, not 66% - 2/3rds = 0%! I take away the % got OF 
the % still desired. A little bit of lower sets always leak in therefore, which 
is better because we can never be sure even if surpass Roof by lots. Besides, 
it gave better results. But Roof is decided by how many predicted symbols are 
in the set (total unique symbols being predicted), so if i have 2 then Roof may 
be 8 counts wanted. Also, while the Roof is based on how many different symbols 
are seen in the set, we get a slightly different Roof if we are on the ex. 5th 
set, i.e. if we have 4 letters in the set #14 then Roof is ex. 33, but if it is 
set #5 then Roof is ex. 26. Also, based on the Roof's size, a curve's bend is 
modified. This Activation Function curve/threshold gives small/large total 
counts in a set an even smaller/larger total (but it isn't used in the 
Arithmetic Coding, it's only used for deciding how much % this set gets in our 
mixer). This is meant to be a exponential activation. Finally a global weight 
is given to each set ex. the 14th set is always given 0.7% of the weight it was 
going to get lol. I hardcoded the numbers for now but the code isn't grossly 
large of course. If they were adaptive and were based on the data then the 
compression would be even better. I just noticed I do exit the mixing before 
reach lower sets if the Roof is ever surpassed, I'll have to test if this is 
useful. The Arithmetic Coder takes the combined sets i.e. the prediction %s are 
combined a, b, c + a, b, c + a, b, c ..... = a, b, c (softmaxed so all the 
predictions add up to 1% i.e. a, b, c = 1%), and the AC then takes a high and 
low bound 1-0 and takes the middle between the high and low, and starts 
misusing each % of the set, until matches the final letter in the window (same 
process whether compress or decompress). So say we stop once reach b in our set 
ex. a, *b*, c, we are in the float precision now of ex. 0.45-0.22. WE take 
middle again (0.23) and start misusing (once the window on the file takes 
another step. The encoding decimal keeps getting more precise, storing the 
whole file. To work in 16 byte float we need to carry away locked digits, 
meaning if the high and low are both now 0.457594-0.458988, we store '45' and 
get now 0.7594-0.8988, and we are going to be taking the middle of these 2 to 
make the decimal more precise then. This long decimal is then stored as a 
binary bin number ex. 6456453634636=10100011100111010011. I didn't implement 
the window to store the last  few letter as branches i.e. the 17 letter window 
adds itself to tree but before predicting next it could add the 16, 15, 14, etc 
as shorter branches which would help just a 'bit' more. I didn't implement the 
removing same counts from lower sets that are just from the higher set, because 
it hurt compression, i.e. if there is 9 counts total in set 3 and 99 total in 
set 2, 9 of the counts in set 2 are the same observations and 'should' not help 
us reach Roof. I'll look into it more. Lastly, escape letters, my first set we 
mix is a dummy set that has super small weight and has every possible letter, 
in case we need to encode/decode one and hasn't yet seen it in the file, hence 
requires a small room in the AC high low bounds. I also hardcoded each 
probability in this dummy set, common letters get more weight. 
Compression/decompression takes 2 hours and 16 minutes for 10MB, but Python is 
slower. Ram is fairly big because I didn't implement the pruning. My algorithm 
handles incomplete/noisy information (uncertainty) unsupervised Online hence 
the mixing of window models. Better net or net compression and/or file 
compression and insight extraction (not decompression of FILE !), faster code 
and less RAM Working Memory used, all lead us closer to AGI, and smaller code 
does (a bit).

My code is in Python but for now I'm linking Shelwien's Green in C++, it's very 
similar. https://encode.su/threads/541-Simple-bytewise-context-mixing-demo

Video:
https://www.youtube.com/watch?v=-9mGm6175BQ

I think one key difference in ANNs for text is the network doesn't store nodes 
that can be displayed as solid letters and phrases as mine can, for example the 
lowest layer nodes a b and c may all point to the parent node 'abc', which has 
ex. a count of 5 times seen so far, but the 'a' that builds it has only 3 
accesses seen so far. So instead of blended words or phrases, like 'cotg' made 
from cat/dog you might even get 'cOtG' where some children affect the node 
less. I'm unsure yet if that's useful.

>From testing GPT-2 and making my own algorithm last year, I have strong 
>evidence that nodes retain energy and the frequency predictions are helped out 
>by already existing energy sitting in related nodes. As you know, when you 
>hear something, it remains on your mind for quite some time. The last 80 words 
>read are all energized, stored in order but as chunks, and are not on paper 
>anymore but in your brain! They *need* to remain active in your brain. The 
>more activated a similar phrase node - the more activated its prediction 
>parents will be. But word nodes may also leak energy to other similar word 
>nodes as well. The energy sitting around definitely will add to the prediction 
>energies therefore, see? If 'leaf' is activated 40 words ago, and our 
>prediction predict letters from word nodes, the leaf and grass etc nodes will 
>also be pre-activated some bit. These energies eventually fade off your mind 
>exponentially.

We can see Facebook's Blender uses also Permanent energies using a "Dialog" as 
they call it, making it *always talk/ask as if it has an agenda for being a 
communist. These nodes are hard reward coded from birth and *should update 
other related nodes to create new sub goals for the food node goal it will 
never change since is more reward hardcoded, you know you can't change the food 
node as its critical for survival.
https://www.youtube.com/watch?v=wTIPGoHLw_8

My main points here is frequency in predictions runs my code, and recognizing 
similar phrases will increase counts (found using frequency, closest affect it 
most in delay time), using energy to boost related predictions helps a ton, and 
permanent reward does too. See how all that and more work in the hierarchies? 
What more can we do!? Can you add anything!?
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/Tf06e133ecd7df7c9-M619dcadfc614fcdbb833359e
Delivery options: https://agi.topicbox.com/groups/agi/subscription

[agi] (video) How can I improve my AGI architecture?

Reply via email to