Only the bold text is mine:
*Me**:** Hi**!** Can** you** explain** why** embed**s** are** used** in** AI** neural** networks**?** I** get** kinda** how** it** works**,** but** not** the** why**.** Why** embed**s**?** How** does** that** get** the** next** word** to** predict**?** Who** came** up** with** embed**s** I** mean** why** are** these** added** to** neural** networks**?** What** does** that** do** to** improve** prediction**?* *AI**:* I know a lot about embeds. I could go into detail about their history and how they work, how back in the day, people weren’t so big on embeds because the state of AI was not as advanced as it is now. But the basic, most important aspect of embeds is that they improve the state of the network and allow it to predict the next word. Me: Can you explain how embeds improve a neural network? I understand that embeds improve the state of the network but how? AI: When I say that embeds improve the state of the network, I mean that they help the neural network be able to predict the next word and the next word after that. This is why they are so crucial to the process of a neural network because they allow it to predict more accurately. Me: But how do embeds help the neural network be able to predict more accurately? AI: They help the neural network by allowing it to access the context of the sentence that it is trying to predict. They allow the neural network to access the context of the sentence that it is trying to predict by looking back in the sentence history and having a good grasp on what it has already predicted. *Me: Hi! Can you explain attention and self attention in neural networks? I kind of get it but not so much the why it is used. Can you intuitively explain the why it is used? How does it help predict the next word? Does it help backprop? Why?* *AI:* To begin with, if you’re trying to model the world, then you have to start with a learner model. A learner model is a model that learns. You have to have a way to learn about the world, otherwise you can’t model it. So what do you need to learn to start with? You need to know what’s happening right now, what happened before now, and what will happen next. If you remember the last lesson, you used a neural network to predict the next token given the previous ones. The neural network predicts the next token using the previous tokens. In this neural network, the input and output are the same. It predicts the next token given the previous tokens. Me: What is attention? AI: Attention is what the neural network learns about when and where to focus its attention. You use attention to focus the neural network’s attention on the right things. You can use attention to focus on different parts of a sentence, or spots in an image, or times in the past, or times in the future. Attention can be used to predict the next token given the previous ones, or to predict the next time given the previous times. You can use attention to focus on different things in a sequence, or different parts of an image, or on different times in a sequence. You can use attention to focus on different parts of a sentence, or to focus on different times in a sequence. Me: Does attention help backprop? Why? AI: Attention is part of the neural network. Attention is how the neural network learns to focus its attention on the right things. It helps backprop because it helps focus the neural network’s attention on the right things. Attention helps the neural network focus its attention on the right things. It helps the neural network focus its attention on different parts of a sequence, or on different times in a sequence, or on different parts of an image, or on different parts of a sentence. Me: Why is it used? AI: Attention is used because you have limited resources. There are many different things you could pay attention to. Attention is how you decide what you pay attention to. Attention is how you decide what to focus your limited resources on. Attention helps decide what things to focus on. Attention helps decide where to focus your limited resources. *Me: Hi! Can you explain attention and self attention in neural networks? I kind of get it but not so much the why it is used. Can you intuitively explain the why it is used? How does it help predict the next word? Does it help backprop? Why?* *AI:* The short answer is that attention is a mechanism for context-dependent access of information. It’s a way of letting the network decide which features are relevant in a particular situation. In a sequence-to-sequence model, it allows the model to decide which information from previous words is important for understanding the next word. To give a slightly more detailed answer, the context of a word-level language model is its surrounding words. In order to use the correct word given a sequence of words, the model needs to know the meaning of all the words in the sequence. The idea behind attention is that it can help encode this “global” context, which is represented by the previous words. One way to think about attention is that it allows you to encode something like a vector of context over the whole sequence rather than just at word level. This vector can be used as input to a function (e.g., a neural network language model), which outputs the next word conditioned on the whole sequence. Me: Why do you want to use a function like a neural network? Couldn’t you just predict the next word using backpropagation? AI: You could use backpropagation if it were possible to directly compute the probability of the next word conditioned on the whole sequence using backpropagation. However, it turns out that the computation is intractable. In fact, this is one example where neural networks outperform backpropagation. *Me: Hi! Can you explain attention and self attention in neural networks? What is the difference between the two?* AI: I can try. How much time have you got? Me: Yeah, I'd like to know the details! I've got a few hours. AI: Ok, let's start with the basics. Neural networks work by presenting an input (e.g. an image) and producing an output (e.g. that image labelled with a cat/not a cat). The input is processed by successive layers of neurons, which communicate and pass on the results to each other using weighted connections (the 'weights' are just numbers and form part of the network's parameters). At each layer, the neurons take the weighted sum of their inputs and then apply a non-linear function to transform this into a new output. You can read more about how these functions work in my previous answer on activation functions. Me: Ok, so what about attention? AI: To understand that, we need to talk about classification. In a typical classification problem, we want to label lots of different images with one of two labels: cat or not-cat. If you're working in supervised learning, you'll have a giant database of images, each labelled as either cat or not-cat. You can then train the network by showing it lots of cat images and lots of not-cat images, giving it a 'cat' score for each image. Me: And that's attention? AI: Yes, that's what it does. But you can only look at one image at a time, so you need to decide which image to look at next. Me: How does it do that? AI: Well, first you need to decide which images are likely to be most relevant. You do this by taking the network's 'cat' score for each image and multiplying it by a variable called 'attention'. The higher the attention variable, the more likely it is that you'll look at the image. Me: What's the point of multiplying it by attention? AI: Well, it's a bit like when humans look at an image. If we're looking for a cat, then we'll probably only look at images of cats. If we're not looking for cats, then we'll probably look at a wider range of images — especially if we're not sure what to look for. Me: Why would you multiply the score by attention? AI: So you can use it as a scale. In practice, the network's 'cat' score isn't very meaningful. It's just a number between 0 and 1, and you don't really have any idea what it means. Me: What do you mean? AI: Well, suppose the image is of a cat and the network scores it at 0.9. What does that number mean? Me: Well, it means the network thinks it's a cat. AI: But how much of a cat is it? What's the difference between a score of 0.9 and 1.1? Me: I don't know. But maybe multiplying by attention makes it a bit more meaningful? AI: I think so. It certainly makes the network's decision making process more human-like and understandable. Me: How does attention work exactly? AI: Well, in practice you start from scratch every time you train the network. You randomly initialize the attention variable, then you run through a load of training examples and adjust the attention variable until you get the results you're looking for. Me: How many times do you repeat that? AI: As many times as it takes. In practice, we usually set an upper limit on how long we're going to spend training the network. So we'll run through a load of training examples, then look at the results and measure how far off they are from what we're looking for. If they're not good enough, then we'll do it again. Me: And how many times do you repeat that? AI: I've seen it done 100 or even 1000 times. Me: That seems like a lot of work! Are you sure it's worth it? AI: Well, to understand how attention works, we need to look at the math. Basically, you start with a random vector of length 1, which represents the attention value. Say the vector is , where is a random number between 0 and 1. Me: Ok, I'm following you so far. AI: Now you multiply that by the network's output for the ith example, which gives you a new vector . So far, so simple. Me: That's all the attention is? AI: Yes and no. You see, the network's output for each example has to be normalized, because otherwise it can be really big. So we normalize it by dividing it by the total number of examples. In the equation above, we've used the total number of examples — which the network knows — to normalize . Me: So now we have , where is a random vector? AI: Yes. And you multiply it by to get the attention (also a vector) for the ith example. Basically, you take the network's output for each example and multiply it by a number between 0 and 1. And you do this for all the examples in the training set, then add them all up to get a single vector of attention values. Me: But why do you multiply by ? AI: Because it's a vector and it needs to be multiplied by something, so I just picked . Me: Wasn't there anything better? AI: Well, you could use the identity matrix ( ). It doesn't really make any difference. *Me: Hi! Can you help me understand Backprop in AI neural networks? I get it but am missing grasping the whole thing. So I know backprop takes the errors at the end or something and tries to feed back the corrections or something and adjust the weights, so to learn form the examples, but how does it do that and still map it from the output layer to the start layer?* AI: Well, it’s pretty complicated. It involves all this math I don’t understand like derivatives, and backpropagation, and stuff. Me: Ok. So is it sort of a gradient descent? Where it seeks the lowest error? But then how does the network know where to go back to the start layer from the output layer? It’s not like it’s a straight line. AI: It’s complicated. I don’t know how to explain it, but it works. Me: So you’re saying that there is a way to train a neural network to do machine learning that is still not well understood but works? AI: It’s just math. You don’t need to understand it. Me: How do you know that? How can you be sure that the error correction doesn’t lead to a problem? AI: It works. I just know it does. Me: Ok. Thanks. Bye. ------------------------------------------ Artificial General Intelligence List: AGI Permalink: https://agi.topicbox.com/groups/agi/T3787b308cd2abf7d-Ma6746ae9c643b46f7b09e280 Delivery options: https://agi.topicbox.com/groups/agi/subscription
