[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1625: [SYSTEMDS-3303] WIP: NN Builtin: Attention Layer (need feedback pls)

GitBox Mon, 06 Jun 2022 04:57:02 -0700


Baunsgaard commented on code in PR #1625:
URL: https://github.com/apache/systemds/pull/1625#discussion_r890052228



##########
scripts/nn/examples/AttentionExample.dml:
##########
@@ -0,0 +1,442 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+# We implement a simple example using the basic self-attention
+# mechanism in combination with a LSTM recurrent layer.
+#
+# We use the clickbait dataset
+# 
(https://www.kaggle.com/datasets/amananandrai/clickbait-dataset?select=clickbait_data.csv)
+# which is a simple binary text classification with 32000 samples.
+#-------------------------------------------------------------
+
+
+source("nn/layers/attention.dml") as attention
+source("nn/layers/affine.dml") as affine
+source("nn/layers/lstm.dml") as lstm
+source("nn/layers/relu.dml") as relu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/adam.dml") as adam
+source("nn/layers/log_loss.dml") as log_loss
+
+
+# 1 get data
+data_loc = "scripts/nn/examples/data/"
+tableschema = "string,int"
+N=32000 # Samples of whole dataset
+n=8000  # Samples to use for training
+max_length = 32 # maximum sequence length
+epochs = 30
+batch_size = n/200

Review Comment:
   this is a questionable way of selecting batch size.
   Should you not simply set it to 40?



##########
scripts/nn/examples/AttentionExample.dml:
##########
@@ -0,0 +1,442 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+# We implement a simple example using the basic self-attention
+# mechanism in combination with a LSTM recurrent layer.
+#
+# We use the clickbait dataset
+# 
(https://www.kaggle.com/datasets/amananandrai/clickbait-dataset?select=clickbait_data.csv)
+# which is a simple binary text classification with 32000 samples.
+#-------------------------------------------------------------
+
+
+source("nn/layers/attention.dml") as attention
+source("nn/layers/affine.dml") as affine
+source("nn/layers/lstm.dml") as lstm
+source("nn/layers/relu.dml") as relu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/adam.dml") as adam
+source("nn/layers/log_loss.dml") as log_loss
+
+
+# 1 get data
+data_loc = "scripts/nn/examples/data/"
+tableschema = "string,int"
+N=32000 # Samples of whole dataset
+n=8000  # Samples to use for training
+max_length = 32 # maximum sequence length
+epochs = 30
+batch_size = n/200
+val_size = batch_size * 5
+
+data = read(data_loc + "clickbait_data.csv", format="csv", header=TRUE, 
sep=",", data_type="frame", schema=tableschema, cols=2, rows=N)
+
+
+[x_train, y_train, vocab_size] = preprocess(data, max_length, N)
+
+x_train = x_train[1:n]
+y_train = y_train[1:n]
+
+# train network
+# TODO fix: get error when batch size is not always equal

Review Comment:
   hopefully not a problem anymore.



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   * Outputs:
+   * - attention: Attention on value(s) for given query(s), of shape (N,K*L).
+   */
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+  norm = 1/M^0.5
+  key_norm = key * norm
+  attention = matrix(0, rows=N, cols=K*L)

Review Comment:
   I would like it if this attention matrix is a part of the arguments.
   Then in the beginning of the attention step simply assign it fully to zero 
when calling this method.
   
   The interface then become: query, value, key, K, attMatrix
   
   With this we can avoid allocating the matrix every loop.
   



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)

Review Comment:
   we can make a check inside the function to say, if key  is empty matrix 
(that you set as default argument) then set key to be equal to value.



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   * Outputs:
+   * - attention: Attention on value(s) for given query(s), of shape (N,K*L).
+   */
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+  norm = 1/M^0.5
+  key_norm = key * norm
+  attention = matrix(0, rows=N, cols=K*L)
+  for (n in 1:N)
+  {
+    query_n = matrix(query[n], rows=K, cols=M)
+    key_norm_n = matrix(key_norm[n],rows=K, cols=M)
+    value_n = matrix(value[n], rows=K, cols=L)
+    scores = query_n %*% t(key_norm_n)

Review Comment:
   This is inefficient, you slice out a single row in each of the inputs and 
then matrix multiply.
   instead you should simply multiply the entire input once to get scores.
   with this change you can remove the entire for loop and process on matrices 
instead.



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).

Review Comment:
   These dimensions does not align with TensorFlow's specification
   https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention
   
   Key and value is same dimensions
   



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   * Outputs:
+   * - attention: Attention on value(s) for given query(s), of shape (N,K*L).
+   */
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+  norm = 1/M^0.5
+  key_norm = key * norm
+  attention = matrix(0, rows=N, cols=K*L)
+  for (n in 1:N)
+  {
+    query_n = matrix(query[n], rows=K, cols=M)
+    key_norm_n = matrix(key_norm[n],rows=K, cols=M)
+    value_n = matrix(value[n], rows=K, cols=L)
+    scores = query_n %*% t(key_norm_n)
+    #column wise softmax
+    probs = t(softmax::forward(t(scores)))
+    attention_n = probs %*% value_n
+    attention[n] = matrix(attention_n, rows=1, cols=K*L)
+  }
+}
+
+backward = function(matrix[double] dattention,
+                  matrix[double] query, matrix[double] key, matrix[double] 
value,
+                  integer K)
+    return (matrix[double] dquery, matrix[double] dkey, matrix[double] dvalue)
+{
+  /*
+   * Computes the backward pass for the attention layer.
+   *
+   * Inputs:
+   * - dattention: Gradient wrt `attention` of shape (N,K*L).
+   * - query: Query input of shape (N,K*M).
+   * - key: Keys for values of shape (N,K*M).
+   * - value: Values for given key of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   *
+   * Outputs:
+   * - dquery: Gradient wrt `query`, of shape (N,M).
+   * - dkey: Gradient wrt `key`, of shape (N,M).
+   * - dvalue: Gradient wrt `value` of shape (N,L).
+   */
+
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+
+  norm = 1 / M^0.5
+  key_norm = key * norm
+
+  dquery = matrix(0, rows=N, cols=K*M)
+  dkey = matrix(0, rows=N, cols=K*M)
+  dvalue = matrix(0, rows=N, cols=K*L)

Review Comment:
   Similar to forward we can add these as arguments to avoid having to re 
allocate them, and on call simply set all values to zero.



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)

Review Comment:
   Yes key should be optional and i suggest set as third parameter.
   



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   * Outputs:
+   * - attention: Attention on value(s) for given query(s), of shape (N,K*L).
+   */
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+  norm = 1/M^0.5
+  key_norm = key * norm
+  attention = matrix(0, rows=N, cols=K*L)
+  for (n in 1:N)
+  {
+    query_n = matrix(query[n], rows=K, cols=M)
+    key_norm_n = matrix(key_norm[n],rows=K, cols=M)
+    value_n = matrix(value[n], rows=K, cols=L)
+    scores = query_n %*% t(key_norm_n)
+    #column wise softmax
+    probs = t(softmax::forward(t(scores)))
+    attention_n = probs %*% value_n
+    attention[n] = matrix(attention_n, rows=1, cols=K*L)
+  }
+}
+
+backward = function(matrix[double] dattention,
+                  matrix[double] query, matrix[double] key, matrix[double] 
value,
+                  integer K)
+    return (matrix[double] dquery, matrix[double] dkey, matrix[double] dvalue)
+{
+  /*
+   * Computes the backward pass for the attention layer.
+   *
+   * Inputs:
+   * - dattention: Gradient wrt `attention` of shape (N,K*L).
+   * - query: Query input of shape (N,K*M).
+   * - key: Keys for values of shape (N,K*M).
+   * - value: Values for given key of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   *
+   * Outputs:
+   * - dquery: Gradient wrt `query`, of shape (N,M).
+   * - dkey: Gradient wrt `key`, of shape (N,M).
+   * - dvalue: Gradient wrt `value` of shape (N,L).
+   */
+
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+
+  norm = 1 / M^0.5
+  key_norm = key * norm
+
+  dquery = matrix(0, rows=N, cols=K*M)
+  dkey = matrix(0, rows=N, cols=K*M)
+  dvalue = matrix(0, rows=N, cols=K*L)
+
+  for (n in 1:N)
+  {
+    query_n = matrix(query[n], rows=K, cols=M)
+    key_norm_n = matrix(key_norm[n], rows=K, cols=M)
+    value_n = matrix(value[n], rows=K, cols=L)
+    dattention_n = matrix(dattention[n], rows=K, cols=L)
+
+    scores = query_n %*% t(key_norm_n)
+    probs = t(softmax::forward(t(scores)))
+
+    dvalue_n = t(probs) %*% dattention_n
+    dprobs = t(value_n %*% t(dattention_n))
+    dscore = t(softmax::backward(t(dprobs), t(scores)))
+    dquery_n = dscore %*% key_norm_n
+    dkey_n = t(dscore) %*% query_n * norm
+    dquery[n] = matrix(dquery_n, rows=1, cols=K*M)
+    dkey[n] = matrix(dkey_n, rows=1, cols=K*M)
+    dvalue[n] = matrix(dvalue_n, rows=1, cols=K*L)
+  }

Review Comment:
   similar to forward i think we can avoid this for loop.



##########
src/test/scripts/applications/nn/grad_check.dml:
##########
@@ -2537,3 +2538,78 @@ elu = function() {
      }
    }
 }
+
+attention = function() {

Review Comment:
   I do not understand what this method is testing, maybe a comment would help.



##########
scripts/nn/examples/AttentionExample.dml:
##########
@@ -0,0 +1,442 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+# We implement a simple example using the basic self-attention
+# mechanism in combination with a LSTM recurrent layer.
+#
+# We use the clickbait dataset
+# 
(https://www.kaggle.com/datasets/amananandrai/clickbait-dataset?select=clickbait_data.csv)
+# which is a simple binary text classification with 32000 samples.
+#-------------------------------------------------------------
+
+
+source("nn/layers/attention.dml") as attention
+source("nn/layers/affine.dml") as affine
+source("nn/layers/lstm.dml") as lstm
+source("nn/layers/relu.dml") as relu
+source("nn/layers/sigmoid.dml") as sigmoid
+source("nn/optim/adam.dml") as adam
+source("nn/layers/log_loss.dml") as log_loss
+
+

Review Comment:
   it would be nice to have yes,
   simply write something like `wget https://super.nice.webpage/dataset.csv` 
   inside a download_attentionExample.sh script.
   and make sure that you add a git ignore for it.
   
   If you cannot make such a link then we can put the data set on our Apache 
webpage and download from there.



##########
scripts/nn/layers/attention.dml:
##########
@@ -0,0 +1,108 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+source("nn/layers/softmax.dml") as softmax
+
+
+forward = function(matrix[double] query, matrix[double] key, matrix[double] 
value, integer K)
+    return (matrix[double] attention) {
+  /*
+   * Computes the forward pass for the attention layer.
+   *
+   * Inputs:
+   * - query: Input querys of shape (N,K*M).
+   * - key: Key(s) for value(s) of shape (N,K*M).
+   * - value: Value(s) for key(s) of shape (N,K*L).
+   * - K: Sequence length / number of timesteps.
+   * Outputs:
+   * - attention: Attention on value(s) for given query(s), of shape (N,K*L).
+   */
+  N = nrow(key)
+  M = ncol(query) / K
+  L = ncol(value) / K
+  norm = 1/M^0.5
+  key_norm = key * norm
+  attention = matrix(0, rows=N, cols=K*L)
+  for (n in 1:N)
+  {
+    query_n = matrix(query[n], rows=K, cols=M)
+    key_norm_n = matrix(key_norm[n],rows=K, cols=M)
+    value_n = matrix(value[n], rows=K, cols=L)
+    scores = query_n %*% t(key_norm_n)
+    #column wise softmax
+    probs = t(softmax::forward(t(scores)))
+    attention_n = probs %*% value_n
+    attention[n] = matrix(attention_n, rows=1, cols=K*L)
+  }
+}
+

Review Comment:
   we could, but for now softmax is fine, if we want more then we add an 
optional argument specifying the the type wanted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] Baunsgaard commented on a diff in pull request #1625: [SYSTEMDS-3303] WIP: NN Builtin: Attention Layer (need feedback pls)

Reply via email to