With image compression, and building an understanding of the world first and then adding language on top: humans regularly don't talk about microscopic levels or walls of bread bag etc, we say "bread" - we learn to segment objects in vision, vision has noise in images and there's never an exact match to context like in PPM text predictors, so what we do is we recognize different breads and group them as a representation for bread, and can store different type images and specific episodal images as well, then we we see any of these bread, it is an exact match woho!, then we can get a entailing prediction of a hand approaching the bread, it's like capsule networks, balls linking to balls, bread>hand>mouth>chews>swallows>leaves. So image compression evaluation would work, the prediction for a TYPE of hand following the previous image of part of same image would be high if see a bread bag or coffee mug. You have to let ALL breads think HAND entails, that's the key. Awesome! Now, as I said, AGI doesn't at first need a body, text will do, but as for building a visual understanding from movies, it simply is only grouping breads as 1 bread "word" as a capsule (the capsule 'bread', is a visual word, non-text), what you really need is text where humans have already groupified it for you! Bread, there's ur text capsule. Text is just that! And gives you power to talk about types of bread or atoms. This building a understanding of the world first is language, all is language...words are part of vision really, same thing. As for ignoring low level talks, talking about high level objects, yes, text/vision we see/do that, it's efficient, and when we need we can go low level when deal with few things otherwise go high level. So what we learnt here is text is all you need basically, and we learnt here how to do image compression.
Vision still seems odd to me btw, I mean you can imagine a large metal chess piece sitting on a underground lava floor with crimson red glow and a yellow strap laid on top with a reflection of it. How? And how do we recognize it all if saw in real life? It seems, these capsules work for different lighting, angles, rotations, location, motion, distortion, colors, size, and similar looking objects. You can recognize the chess piece, the dirt on it, the lava floor, the whole scene... but what about image generation? You add a large chess piece to where you want on the laval floor, lay a yellow strap over top it, but what accounts for the hiding of it? The reflections? Etc? One object, being near the other, is transforming it!? Blending their context... I know, it gets these abilities from seeing lots of movies, but its exact way it works boggles me still. Even if human brains don't actually generate such physics sims, we can on computers using 2D or 3D vision pixels/voxels. The 3D would be rather easy to do just this. But for 2D, I'm stuck. And how useful is it anyway? Say you got an glass of red wine and a bug and a cloth that gets wet traveling up the cloth and also a table for water to fall off it, and run the 2D sim to get more detail than text can provide, how does 2D image generation know when water should fall off the table, hide behind objects, and share reflection with the red wine etc? As for the usefulness of this, it is far more detail than text but maybe no, i mean text can talk about it, and so does physics prediction in text as text prediction is that, text can even say the Water Reflected the Wine's red color, being near it, or fell behind it and disappeared but was waiting there hidden working on a make-do tool after turning into a caveman - oh ya, i can daydream animals morphing like GANs do. That morphing, can be objectfied as a word "morph", and the actual segmentation of cat>tiger>bear>man>moleRat>hamster>slug>worm>pencil>stick can be said to as words. After all, text is object-fying vision as capsules, same thing. Just less context as vision would show maybe. As for the words changing each other, like the sentence is "red wine fell on a hamster", it blends context maybe to get "wine splashed and made a hamster wet". That mirrors our sight seeing, our data reflects our visual capsules and their transformation. The hamster, could be said to become crippled, now the crippled hamster is walking around, the "paraplegic" i could call it now instead. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T2a0cd9d392f9ff94-M3f2e822e256fb12c8edc3a80 Delivery options: https://agi.topicbox.com/groups/agi/subscription
