AI is about, and needs, patterns. -- Otherwise you use Brute Force whenever you want to invent a solution ex. cars, TVs, pills, etc. So, we NEED patterns. Prior experiences.
Text and image data both have patterns. You can for example let the machine discover (using text) and inform you "i love food, i have a tongue, i smell, the hamster may steal your food, because hamsters also love food, have tongues, and smell". Text can tell you what molecule structures are similar to each other (semantics) or what they evolve into "usually" ex. dog>bark more often than dog>sleep (syntactics). So, what kind of patterns can images contain that text can't? What? That cats are similar to dogs? Or that cats lick paws? Or that cats have ticks on the upper left portion of its abdomen? Text can do all that. Maybe vision is the same but only gives you better accuracy? Keep reading. Matt will know the following more than you other guys I bet. You can see in the Hutter Prize contest many algorithms - and their histories, you add or improve mechanisms (or, data) and wala, the prediction becomes better. For the Hutter Prize, the dataset size is fixed, because we know adding data is 1 way to improve prediction accuracy, the goal is to make the contestants find better the patterns in the fixed dataset. Some dude may add a new mechanism. Then next year, improve it, or add another New mechanism. But, they all, even just adding more dataset, just improve prediction accuracy more. My point is, be it text or image, you can add more data, or add new and never before seen pattern finding mechanisms, or improve the ones you found, and all you get is extra accuracy in prediction. We already *saw that* in text. And images. Both, are grounded. I'm not saying images doesn't have more data. It could. Apparently much of "image data" is useless or noise. Not sure. Now, you may be thinking, we know both text and images have a prediction accuracy thing going on, which can tell you new discovers and help you. - But does the prediction accuracy in text "transfer" to images? It must. As said, text can inform you bees may be dangerous, even though you only know ants are! Image data would tell you just that too. And this means also you could use the same "algorithm" for text on images too. Ok, but what if (assuming we found all text pattern finding mechanisms, improved them max, and fed all our text data) text is missing key areas of images we ignored/ got wrong, ex. fine details of objects? Call that "Lack". Or what if vision has much more data/patterns or cus of its video structure and hence prediction accuracy rises faster? Call that "structure". Well, for issue1 we can feed the algorithm images instead of text, but for issue2, it may requires some pattern finding mechanism adjustment to cope with video structure and the image structure and persistence structure (video does flash dog, ate, food, it shows dog at all times in many cases). So in conclusion, text AGI can be fed image data and get all its benefits BUT it is possible (well, maybe) that our now "video AGI text-algorithm" versus video AGI could be somewhat less accurate than video and may even require a slightly different set of pattern finding mechanisms needed. So the only question then is not if text=vision, many of the same predictions exists in both and for the same reasons, our question is how does video structure pattern finding work or does it exist (if it's more powerful at finding patterns)? One way to think about that question is (not just how the structure works, but why it's more powerful, or if it exists): how can video hold more "patterns/data"? If you have a video of walking through a home with cats etc, or explaining the cure to cancer (yes, visionGPT-2) what do we have here? The cat persists in each frame, well, at least a few, and is accompanied by objects in the same frame space. A context (a frame), that is part of a context (video)? So we have a sequence cat>ran>slowed>sat>slept and at each frame there is accompanied sequences, like music. As we watch a video, our eye is actually paying attention to mostly a certain area (be it a word on your screen or the whole page of writing (you cant read like that btw, you must read word by word, but seeing the book's page is a "word" yes; page)). We still seeing the other sequences though a bit some more too as said. And it's not just extra context or words, it's connected right in time. Also, when we see the video, we may see 3D or reflections on cat, this can let us know cat is shiny or near probably a flash light, from just a single frame too, hmm, I think seeing "shiny" or "flashlight" or "behind an object" is another "word" or view of something related. So conclusion? I think they are very similar. OpenAI's IGPT knew to put reflections under the birds. https://openai.com/blog/image-gpt/ --------------------------- --------------------------- another thing i just wrote: the thing about navigating or chair manufactoring or golfing is these are narrow things to become good at, text or vision is general and can decsribe any story / anythnig. / we model the universe, and have the goal survival we use all knowledge...however we specialize in a domain such as stem cells or AGI to collect data from those "experiments" only rarely we explore, ex. get bored and read the news during exam lol ultimately, text and vision is general and can describe anything, and all humans ultimately/culmatively focus on some specific domain (on average) in our physics and in ex. computing industry (DNA, brains, and our devices, computing/ mutating data is a "thing") so in some sense, all humans talk or work for the computing industry, be it our genes or the new "microsoft" our "survival" uses all info, but some more than other.... it is computing data, and it cares about...computing data :slight_smile: ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T0c5c4825f9d718f4-M9daf91c4f2be13e6bae151e8 Delivery options: https://agi.topicbox.com/groups/agi/subscription
