If you do what GPT-2 does, it can give you plans, but the human must act them out. The text is just data, if you add images, it's just more data involved in Self-Attention. That'd result in a visual|text GPT-3 that has simply more context.
Did you see MuseNet? It predicts the next note for each instrument, using all notes from each instrument. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T4f01e8a4b34d0e2a-Mfb63426fd295d9353b032646 Delivery options: https://agi.topicbox.com/groups/agi/subscription
