We may also see a movie -like video in the background , perhaps an animated
one - showing a cat, or in an image, some people on a plane, flying with
cat , or with a camera on their lap. If we want to know more about the
context of a video then we may be able to see that video. But we would like
more context to be seen. Then maybe this information can be retrieved in
time, or in a more general sense could be retrieved in a more detailed way
that is more obvious to humans and perhaps not necessarily useful for
computers as well. (In addition the two most recent  examples are a few
sentences of our own and a few examples from a group of people.) We see a
certain similarity between the context of a movie and this context
(possibly a shared or shared view) that we think we need more context to
retrieve. This would also be useful for computers, because for instance if
we can identify if one person is on the plane, or if we can see the movie
with an iPhone, then we can know if the one person on the plane is on that
plane (possibly a shared view). We also see a number of ways that a view of
a video is related to what the computer is seeing in the background or
perhaps not in time (perhaps a similar or overlapping view of the movie or
a similar or overlapping view of a movie).



We may find this information useful for computers in several ways. For
instance we may find it helpful for machines that are trained on a
continuous video feed that is not interrupted by a single frame being
recorded , where an error can still be discovered by a neural network, or
the information may be used for other tasks (such as text summarization),
or it may be used for other tasks that humans do not typically use to find
information about the video in the background. In other cases, for instance
we may find it helpful for humans to keep track of how far from the target
window is to the next frame of video that we use for text summarization (in
contrast to how we can find important information about the target window).
We may also find it helpful for systems trained on multi-view video, where
the size of the target window (the size of the movie ) or of the video
frames (the size of the movie or video) may not be enough to find out what
is happening in the video , such as when the frames are shorter or when the
images are bigger (when the images are bigger or smaller than the movie or
the video).

In the present work we show how the word and phrase embeddings learn to
combine with the embeddings to form a sequence of word embeddings. We also
show how the word embeddings learn to combine with the embeddings to form a
sequence of phrase embeddings, and finally, we compare these sequences of
embeddings to a sequence of word embeddings. We report on the development
and evaluation of the neural machine translation (NMT) system in Figure 1.


--Generated with Arxiv-NLP primed with your email. ;)

On Thu, Jul 16, 2020 at 5:45 AM <[email protected]> wrote:

> AI is about, and needs, patterns. -- Otherwise you use Brute Force
> whenever you want to invent a solution ex. cars, TVs, pills, etc. So, we
> NEED patterns. Prior experiences.
>
> Text and image data both have patterns. You can for example let the
> machine discover (using text) and inform you "i love food, i have a tongue,
> i smell, the hamster may steal your food, because hamsters also love food,
> have tongues, and smell". Text can tell you what molecule structures are
> similar to each other (semantics) or what they evolve into "usually" ex.
> dog>bark more often than dog>sleep (syntactics).
>
> So, what kind of patterns can images contain that text can't? What? That
> cats are similar to dogs? Or that cats lick paws? Or that cats have ticks
> on the upper left portion of its abdomen? Text can do all that. Maybe
> vision is the same but only gives you better accuracy? Keep reading.
>
> Matt will know the following more than you other guys I bet. You can see
> in the Hutter Prize contest many algorithms - and their histories, you add
> or improve mechanisms (or, data) and wala, the prediction becomes better.
> For the Hutter Prize, the dataset size is fixed, because we know adding
> data is 1 way to improve prediction accuracy, the goal is to make the
> contestants find better the patterns in the fixed dataset. Some dude may
> add a new mechanism. Then next year, improve it, or add another New
> mechanism. But, they all, even just adding more dataset, just improve
> prediction accuracy more. My point is, be it text or image, you can add
> more data, or add new and never before seen pattern finding mechanisms, or
> improve the ones you found, and all you get is extra accuracy in
> prediction. We already *saw that* in text. And images. Both, are grounded.
> I'm not saying images doesn't have more data. It could. Apparently much of
> "image data" is useless or noise. Not sure.
>
> Now, you may be thinking, we know both text and images have a prediction
> accuracy thing going on, which can tell you new discovers and help you. -
> But does the prediction accuracy in text "transfer" to images? It must. As
> said, text can inform you bees may be dangerous, even though you only know
> ants are! Image data would tell you just that too. And this means also you
> could use the same "algorithm" for text on images too. Ok, but what if
> (assuming we found all text pattern finding mechanisms, improved them max,
> and fed all our text data) text is missing key areas of images we ignored/
> got wrong, ex. fine details of objects? Call that "Lack". Or what if vision
> has much more data/patterns or cus of its video structure and hence
> prediction accuracy rises faster? Call that "structure". Well, for issue1
> we can feed the algorithm images instead of text, but for issue2, it may
> requires some pattern finding mechanism adjustment to cope with video
> structure and the image structure and persistence structure (video does
> flash dog, ate, food, it shows dog at all times in many cases).
>
> So in conclusion, text AGI can be fed image data and get all its benefits
> BUT it is possible (well, maybe) that our now "video AGI text-algorithm"
> versus video AGI could be somewhat less accurate than video and may even
> require a slightly different set of pattern finding mechanisms needed. So
> the only question then is not if text=vision, many of the same predictions
> exists in both and for the same reasons, our question is how does video
> structure pattern finding work or does it exist (if it's more powerful at
> finding patterns)?
>
> One way to think about that question is (not just how the structure works,
> but why it's more powerful, or if it exists): how can video hold more
> "patterns/data"? If you have a video of walking through a home with cats
> etc, or explaining the cure to cancer (yes, visionGPT-2) what do we have
> here? The cat persists in each frame, well, at least a few, and is
> accompanied by objects in the same frame space. A context (a frame), that
> is part of a context (video)? So we have a sequence
> cat>ran>slowed>sat>slept and at each frame there is accompanied sequences,
> like music. As we watch a video, our eye is actually paying attention to
> mostly a certain area (be it a word on your screen or the whole page of
> writing (you cant read like that btw, you must read word by word, but
> seeing the book's page is a "word" yes; page)). We still seeing the other
> sequences though a bit some more too as said. And it's not just extra
> context or words, it's connected right in time. Also, when we see the
> video, we may see 3D or reflections on cat, this can let us know cat is
> shiny or near probably a flash light, from just a single frame too, hmm, I
> think seeing "shiny" or "flashlight" or "behind an object" is another
> "word" or view of something related.
>
> So conclusion? I think they are very similar. OpenAI's IGPT knew to put
> reflections under the birds. https://openai.com/blog/image-gpt/
>
> ---------------------------
> ---------------------------
>
> another thing i just wrote:
>
> the thing about navigating or chair manufactoring or golfing is these are
> narrow things to become good at, text or vision is general and can decsribe
> any story / anythnig.
> / we model the universe, and have the goal survival
> we use all knowledge...however we specialize in a domain such as stem
> cells or AGI to collect data from those "experiments"
> only rarely we explore, ex. get bored and read the news during exam lol
> ultimately, text and vision is general and can describe anything, and all
> humans ultimately/culmatively focus on some specific domain (on average) in
> our physics and in ex. computing industry (DNA, brains, and our devices,
> computing/ mutating data is a "thing")
> so in some sense, all humans talk or work for the computing industry, be
> it our genes or the new "microsoft"
> our "survival" uses all info, but some more than other....
> it is computing data, and it cares about...computing data :slight_smile:
> *Artificial General Intelligence List <https://agi.topicbox.com/latest>*
> / AGI / see discussions <https://agi.topicbox.com/groups/agi> +
> participants <https://agi.topicbox.com/groups/agi/members> + delivery
> options <https://agi.topicbox.com/groups/agi/subscription> Permalink
> <https://agi.topicbox.com/groups/agi/T0c5c4825f9d718f4-M9daf91c4f2be13e6bae151e8>
>


-- 
Daniel Jue
Owner
Cognami LLC

------------------------------------------
Artificial General Intelligence List: AGI
Permalink: 
https://agi.topicbox.com/groups/agi/T0c5c4825f9d718f4-M896cd65ca10a03d92847a985
Delivery options: https://agi.topicbox.com/groups/agi/subscription

Reply via email to