With image compression, and building an understanding of the world first and 
then adding language on top: humans regularly don't talk about microscopic 
levels or walls of bread bag etc, we say "bread" - we learn to segment objects 
in vision, vision has noise in images and there's never an exact match to 
context like in PPM text predictors, so what we do is we recognize different 
breads and group them as a representation for bread, and can store different 
type images and specific episodal images as well, then we we see any of these 
bread, it is an exact match woho!, then we can get a entailing prediction of a 
hand approaching the bread, it's like capsule networks, balls linking to balls, 
bread>hand>mouth>chews>swallows>leaves. So image compression evaluation would 
work, the prediction for a TYPE of hand following the previous image of part of 
same image would be high if see a bread bag or coffee mug. You have to let ALL 
breads think HAND entails, that's the key. Awesome! Now, as I said, AGI doesn't 
at first need a body, text will do, but as for building a visual understanding 
from movies, it simply is only grouping breads as 1 bread "word" as a capsule 
(the capsule 'bread', is a visual word, non-text), what you really need is text 
where humans have already groupified it for you! Bread, there's ur text 
capsule. Text is just that! And gives you power to talk about types of bread or 
atoms. This building a understanding of the world first is language, all is 
language...words are part of vision really, same thing. As for ignoring low 
level talks, talking about high level objects, yes, text/vision we see/do that, 
it's efficient, and when we need we can go low level when deal with few things 
otherwise go high level. So what we learnt here is text is all you need 
basically, and we learnt here how to do image compression.

Vision still seems odd to me btw, I mean you can imagine a large metal chess 
piece sitting on a underground lava floor with crimson red glow and a yellow 
strap laid on top with a reflection of it. How? And how do we recognize it all 
if saw in real life? It seems, these capsules work for different lighting, 
angles, rotations, location, motion, distortion, colors, size, and similar 
looking objects. You can recognize the chess piece, the dirt on it, the lava 
floor, the whole scene... but what about image generation? You add a large 
chess piece to where you want on the laval floor, lay a yellow strap over top 
it, but what accounts for the hiding of it? The reflections? Etc? One object, 
being near the other, is transforming it!? Blending their context... I know, it 
gets these abilities from seeing lots of movies, but its exact way it works 
boggles me still. Even if human brains don't actually generate such physics 
sims, we can on computers using 2D or 3D vision pixels/voxels. The 3D would be 
rather easy to do just this. But for 2D, I'm stuck. And how useful is it 
anyway? Say you got an glass of red wine and a bug and a cloth that gets wet 
traveling up the cloth and also a table for water to fall off it, and run the 
2D sim to get more detail than text can provide, how does 2D image generation 
know when water should fall off the table, hide behind objects, and share 
reflection with the red wine etc? As for the usefulness of this, it is far more 
detail than text but maybe no, i mean text can talk about it, and so does 
physics prediction in text as text prediction is that, text can even say the 
Water Reflected the Wine's red color, being near it, or fell behind it and 
disappeared but was waiting there hidden working on a make-do tool after 
turning into a caveman - oh ya, i can daydream animals morphing like GANs do. 
That morphing, can be objectfied as a word "morph", and the actual segmentation 
of cat>tiger>bear>man>moleRat>hamster>slug>worm>pencil>stick can be said to as 
words. After all, text is object-fying vision as capsules, same thing. Just 
less context as vision would show maybe. As for the words changing each other, 
like the sentence is "red wine fell on a hamster", it blends context maybe to 
get "wine splashed and made a hamster wet". That mirrors our sight seeing, our 
data reflects our visual capsules and their transformation. The hamster, could 
be said to become crippled, now the crippled hamster is walking around, the 
"paraplegic" i could call it now instead.
------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T2a0cd9d392f9ff94-M3f2e822e256fb12c8edc3a80
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to