[GitHub] [systemds] mboehm7 commented on a diff in pull request #1879: [SYSTEMDS-3153] Missing value imputation using KNN

via GitHub Thu, 10 Aug 2023 09:27:33 -0700


mboehm7 commented on code in PR #1879:
URL: https://github.com/apache/systemds/pull/1879#discussion_r1290372166



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------

Review Comment:
   please align the columns and surrounding lines appropriately.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values

Review Comment:
   Include the valid method names in the documentation, where you pick 
meaningful names other than method2 and method3



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features

Review Comment:
   this (dist runtime * #features) seems incorrect as the distance computation 
is already O(N2*#features).



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")
+return(Matrix[Double] result)
+{
+  print("KNN-IMPUTATION SCRIPT")
+
+  #Test data, initial data, will be replaced later with variable data
+  first = matrix ("1 3 4 8", rows = 4, cols = 1)
+  second = matrix ("2 4 6 7", rows = 4, cols = 1)
+  third = matrix ("4 NaN 5 NaN", rows = 4, cols = 1)
+  test_matrix = cbind(first,second,third)
+
+  #Add more data for more test cases
+  fifth = matrix("5 6 7", rows = 1, cols = 3)
+  sixth = matrix("3 4 5", rows = 1 , cols = 3)
+  tm2 = rbind(test_matrix, fifth,sixth)
+
+  #Create a mask for placeholder and to check for missing values
+  masked = is.nan(tm2)
+  print(toString(tm2))
+
+  #Find the column containing multiple missing values
+  missing_col = rowIndexMax(colSums(is.nan(tm2)))
+
+  #change method here temporary "default"/"method2" for testing purposes, will 
be replaced by user input
+  method = "method3"
+
+  #Impute NaN value with temporary mean value of the column
+  filled_matrix = imputeByMean(tm2, matrix(0, cols = ncol(tm2), rows = 1))
+
+  if(method == "default"){
+    #METHOD 1
+    #Calculate the distance using dist method after imputation with mean
+    distance_matrix = dist(filled_matrix)
+
+    #Change 0 value so rowIndexMin will ignore that diagonal value
+    distance_matrix = replace(target = distance_matrix, pattern = 0, 
replacement = 999)
+
+    #Get the minimum distance row-wise computation
+    minimum_index = rowIndexMin(distance_matrix)
+
+    #Position of missing values in per row in which column
+    position = rowSums(is.nan(tm2))
+    position = position * minimum_index
+
+    #Filter the 0 out
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=position, margin="rows", select=I)
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(missing, 
seq(1,nrow(filled_matrix)),odim1=nrow(filled_matrix),odim2=nrow(missing))
+
+    #Replace the index with value
+    imputedValue = t(indices) %*% filled_matrix[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Replace the masked column with to be imputed Value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    #Impute the value
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+    print("Result method1")
+    print(toString(result))

Review Comment:
   move common code outside the method-specific branches (at the end of the 
builtin function).



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1

Review Comment:
   move these documentation blocks into the documentation so they are 
automatically picked up when we generate the python API and documentation.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")
+return(Matrix[Double] result)
+{
+  print("KNN-IMPUTATION SCRIPT")
+
+  #Test data, initial data, will be replaced later with variable data
+  first = matrix ("1 3 4 8", rows = 4, cols = 1)
+  second = matrix ("2 4 6 7", rows = 4, cols = 1)
+  third = matrix ("4 NaN 5 NaN", rows = 4, cols = 1)
+  test_matrix = cbind(first,second,third)

Review Comment:
   This test data cannot be in the builtin function, which is supposed to be 
general purpose. Move this to the tests.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")
+return(Matrix[Double] result)
+{
+  print("KNN-IMPUTATION SCRIPT")
+
+  #Test data, initial data, will be replaced later with variable data
+  first = matrix ("1 3 4 8", rows = 4, cols = 1)
+  second = matrix ("2 4 6 7", rows = 4, cols = 1)
+  third = matrix ("4 NaN 5 NaN", rows = 4, cols = 1)
+  test_matrix = cbind(first,second,third)
+
+  #Add more data for more test cases
+  fifth = matrix("5 6 7", rows = 1, cols = 3)
+  sixth = matrix("3 4 5", rows = 1 , cols = 3)
+  tm2 = rbind(test_matrix, fifth,sixth)
+
+  #Create a mask for placeholder and to check for missing values
+  masked = is.nan(tm2)
+  print(toString(tm2))

Review Comment:
   remove such print statements, add a 'verbose' flag and if verbose output is 
requested print appropriate debug information.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")
+return(Matrix[Double] result)
+{
+  print("KNN-IMPUTATION SCRIPT")
+
+  #Test data, initial data, will be replaced later with variable data
+  first = matrix ("1 3 4 8", rows = 4, cols = 1)
+  second = matrix ("2 4 6 7", rows = 4, cols = 1)
+  third = matrix ("4 NaN 5 NaN", rows = 4, cols = 1)
+  test_matrix = cbind(first,second,third)
+
+  #Add more data for more test cases
+  fifth = matrix("5 6 7", rows = 1, cols = 3)
+  sixth = matrix("3 4 5", rows = 1 , cols = 3)
+  tm2 = rbind(test_matrix, fifth,sixth)
+
+  #Create a mask for placeholder and to check for missing values
+  masked = is.nan(tm2)
+  print(toString(tm2))
+
+  #Find the column containing multiple missing values
+  missing_col = rowIndexMax(colSums(is.nan(tm2)))
+
+  #change method here temporary "default"/"method2" for testing purposes, will 
be replaced by user input
+  method = "method3"
+
+  #Impute NaN value with temporary mean value of the column
+  filled_matrix = imputeByMean(tm2, matrix(0, cols = ncol(tm2), rows = 1))
+
+  if(method == "default"){
+    #METHOD 1
+    #Calculate the distance using dist method after imputation with mean
+    distance_matrix = dist(filled_matrix)
+
+    #Change 0 value so rowIndexMin will ignore that diagonal value
+    distance_matrix = replace(target = distance_matrix, pattern = 0, 
replacement = 999)
+
+    #Get the minimum distance row-wise computation
+    minimum_index = rowIndexMin(distance_matrix)
+
+    #Position of missing values in per row in which column
+    position = rowSums(is.nan(tm2))
+    position = position * minimum_index
+
+    #Filter the 0 out
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=position, margin="rows", select=I)
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(missing, 
seq(1,nrow(filled_matrix)),odim1=nrow(filled_matrix),odim2=nrow(missing))
+
+    #Replace the index with value
+    imputedValue = t(indices) %*% filled_matrix[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Replace the masked column with to be imputed Value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    #Impute the value
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+    print("Result method1")
+    print(toString(result))
+
+  } else if(method == "method2"){
+    #METHOD 2
+    #Split the matrix into containing NaN values (missing records) and not 
containing NaN values (M2 records)
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=filled_matrix, margin="rows", select=I)
+
+    Y = (rowSums(is.nan(tm2))==0)
+    M2 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
+
+    #Calculate the euclidean distance between fully records and missing 
records, and then find the min value row wise
+    D = -2 * (M2 %*% t(missing)) + t(rowSums (missing ^ 2));
+    minD = rowIndexMin(t(D))
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(minD, seq(1,nrow(M2)),odim1=nrow(M2),odim2=nrow(minD))
+
+    #Replace the value
+    imputedValue = t(indices) %*% M2[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Update the masked value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+
+    print("Result method2")
+    print(toString(result))
+
+  } else if(method == "method3"){
+    #METHOD 3
+    #Split the matrix into containing NaN values (missing records) and not 
containing NaN values (M2 records)
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=filled_matrix, margin="rows", select=I)
+
+    Y = (rowSums(is.nan(tm2))==0)
+    M3 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
+
+    #Create a random subset
+    random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, 
sparsity = 0.5, seed = 33))
+    #ensure that random_matrix has at least 1 value
+    if(as.scalar(colSums(random_matrix)) < 1) { random_matrix = matrix(1, rows 
= nrow(M3), cols = 1)}
+
+    subset = M3 * random_matrix
+    subset = removeEmpty(target=subset, margin = "rows", select = 
random_matrix)
+
+    #Calculate the euclidean distance between fully records and missing 
records, and then find the min value row wise
+    D = -2 * (subset %*% t(missing)) + t(rowSums (missing ^ 2));
+    minD = rowIndexMin(t(D))
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(minD, 
seq(1,nrow(subset)),odim1=nrow(subset),odim2=nrow(minD))
+
+    #Replace the value
+    imputedValue = t(indices) %*% subset[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Update the masked value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+
+    print("Result method3")
+    print(toString(result))
+  } else {
+    print("Method is unknown or not yet implemented")
+  }
+
+#Default Results to be replaced with variable result
+result = data
+}
+

Review Comment:
   avoid multiple empty lines after end of builtin function



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")

Review Comment:
   add a seed parameter, and derive seeds for all rand/sample calls from this 
seed parameter in order to ensure deterministic behavior when needed.



##########
src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinImputeKNNTest.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin.part1;
+
+import org.apache.sysds.api.DMLScript;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.io.IOException;
+
+public class BuiltinImputeKNNTest extends AutomatedTestBase {
+
+    private final static String TEST_NAME = "imputeByKNN";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinImputeKNNTest.class.getSimpleName() + "/";
+    @Override
+    public void setUp() {
+        TestUtils.clearAssertionInformation();
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[] {"C"}));
+    }
+
+    @Test
+    public void test()throws IOException{
+        runImputeKNN(Types.ExecType.CP);

Review Comment:
   Also replicate the CP/Spark test for each method.



##########
src/main/java/org/apache/sysds/common/Builtins.java:
##########
@@ -166,6 +166,7 @@ public enum Builtins {
        IMG_INVERT("img_invert", true),
        IMG_POSTERIZE("img_posterize", true),
        IMPURITY_MEASURES("impurityMeasures", true),
+       IMPUTE_BY_KNN("imputeByKNN",true),

Review Comment:
   missing space between arguments.



##########
src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinImputeKNNTest.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin.part1;
+
+import org.apache.sysds.api.DMLScript;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.io.IOException;
+
+public class BuiltinImputeKNNTest extends AutomatedTestBase {
+
+    private final static String TEST_NAME = "imputeByKNN";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinImputeKNNTest.class.getSimpleName() + "/";
+    @Override
+    public void setUp() {
+        TestUtils.clearAssertionInformation();
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[] {"C"}));
+    }
+
+    @Test
+    public void test()throws IOException{
+        runImputeKNN(Types.ExecType.CP);

Review Comment:
   add a second test with exectype Spark



##########
src/test/java/org/apache/sysds/test/functions/builtin/part1/BuiltinImputeKNNTest.java:
##########
@@ -0,0 +1,59 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+package org.apache.sysds.test.functions.builtin.part1;
+
+import org.apache.sysds.api.DMLScript;
+import org.apache.sysds.common.Types;
+import org.apache.sysds.test.AutomatedTestBase;
+import org.apache.sysds.test.TestConfiguration;
+import org.apache.sysds.test.TestUtils;
+import org.junit.Test;
+
+import java.io.IOException;
+
+public class BuiltinImputeKNNTest extends AutomatedTestBase {
+
+    private final static String TEST_NAME = "imputeByKNN";
+    private final static String TEST_DIR = "functions/builtin/";
+    private static final String TEST_CLASS_DIR = TEST_DIR + 
BuiltinImputeKNNTest.class.getSimpleName() + "/";
+    @Override
+    public void setUp() {
+        TestUtils.clearAssertionInformation();
+        addTestConfiguration(TEST_NAME, new TestConfiguration(TEST_CLASS_DIR, 
TEST_NAME, new String[] {"C"}));
+    }
+
+    @Test
+    public void test()throws IOException{
+        runImputeKNN(Types.ExecType.CP);
+    }
+
+    private void runImputeKNN(Types.ExecType instType) throws IOException {
+        Types.ExecMode platform_old = setExecMode(instType);
+        try {
+            loadTestConfiguration(getTestConfiguration(TEST_NAME));
+            String HOME = SCRIPT_DIR + TEST_DIR;
+            fullDMLScriptName = HOME + TEST_NAME + ".dml";
+            programArgs = new String[] {}; //
+            runTest(true, false, null, -1);

Review Comment:
   try to generate meaningful inputs and compare the results. You could for 
example check that the sum of the output matrix is roughly the same for all 
three methods. (computed with the brute force dist method)



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -0,0 +1,208 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+# Impute the data by using KNN-algorithm and finding the nearest neighbor 
using euclidean distance
+# Currently only for single column with multiple missing values and top 1 
neighbor
+#
+# INPUT:
+# 
-------------------------------------------------------------------------------------
+# data    Data Matrix (numerical)
+# method  methods of calculating the KNN-algorith depending on the size of 
data and missing values
+# 
-------------------------------------------------------------------------------------
+#
+# OUTPUT:
+# 
-----------------------------------------------------------------------------------
+# result     imputed dataset
+# 
-----------------------------------------------------------------------------------
+
+
+#method 1
+#impute By Mean, really small, coarse grained-operation
+#Extract Top 1 distances, impute the respective value
+#dist runtime * #features
+#method 2
+#assuming missing value is very small, 1%
+#compute only the distances between tuples between missing value, impute the 
mean,
+#compute distance between entire "X" (potentially large) and rows with missing 
values (hopefully small),
+#method 3
+#sample rows, randomly select subset of rows
+#assuming missing value is very large,
+#compute distance between sample of rows of X (control small) and rows with 
missing values (if not small),
+
+m_imputeByKNN = function(Matrix[Double] data, String method = "default")
+return(Matrix[Double] result)
+{
+  print("KNN-IMPUTATION SCRIPT")
+
+  #Test data, initial data, will be replaced later with variable data
+  first = matrix ("1 3 4 8", rows = 4, cols = 1)
+  second = matrix ("2 4 6 7", rows = 4, cols = 1)
+  third = matrix ("4 NaN 5 NaN", rows = 4, cols = 1)
+  test_matrix = cbind(first,second,third)
+
+  #Add more data for more test cases
+  fifth = matrix("5 6 7", rows = 1, cols = 3)
+  sixth = matrix("3 4 5", rows = 1 , cols = 3)
+  tm2 = rbind(test_matrix, fifth,sixth)
+
+  #Create a mask for placeholder and to check for missing values
+  masked = is.nan(tm2)
+  print(toString(tm2))
+
+  #Find the column containing multiple missing values
+  missing_col = rowIndexMax(colSums(is.nan(tm2)))
+
+  #change method here temporary "default"/"method2" for testing purposes, will 
be replaced by user input
+  method = "method3"
+
+  #Impute NaN value with temporary mean value of the column
+  filled_matrix = imputeByMean(tm2, matrix(0, cols = ncol(tm2), rows = 1))
+
+  if(method == "default"){
+    #METHOD 1
+    #Calculate the distance using dist method after imputation with mean
+    distance_matrix = dist(filled_matrix)
+
+    #Change 0 value so rowIndexMin will ignore that diagonal value
+    distance_matrix = replace(target = distance_matrix, pattern = 0, 
replacement = 999)
+
+    #Get the minimum distance row-wise computation
+    minimum_index = rowIndexMin(distance_matrix)
+
+    #Position of missing values in per row in which column
+    position = rowSums(is.nan(tm2))
+    position = position * minimum_index
+
+    #Filter the 0 out
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=position, margin="rows", select=I)
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(missing, 
seq(1,nrow(filled_matrix)),odim1=nrow(filled_matrix),odim2=nrow(missing))
+
+    #Replace the index with value
+    imputedValue = t(indices) %*% filled_matrix[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Replace the masked column with to be imputed Value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    #Impute the value
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+    print("Result method1")
+    print(toString(result))
+
+  } else if(method == "method2"){
+    #METHOD 2
+    #Split the matrix into containing NaN values (missing records) and not 
containing NaN values (M2 records)
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=filled_matrix, margin="rows", select=I)
+
+    Y = (rowSums(is.nan(tm2))==0)
+    M2 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
+
+    #Calculate the euclidean distance between fully records and missing 
records, and then find the min value row wise
+    D = -2 * (M2 %*% t(missing)) + t(rowSums (missing ^ 2));
+    minD = rowIndexMin(t(D))
+
+    #Convert the value indices into 0/1 matrix to find location
+    indices = table(minD, seq(1,nrow(M2)),odim1=nrow(M2),odim2=nrow(minD))
+
+    #Replace the value
+    imputedValue = t(indices) %*% M2[,as.scalar(missing_col)]
+
+    #Get the index location of the missing value
+    x = rowSums(is.nan(tm2))
+    missing_indices = seq(1, nrow(x)) * x
+
+    #Put the replacement value in the missing indices
+    I2 = removeEmpty(target=missing_indices, margin="rows")
+    R = table(I2,1,imputedValue,odim1 = nrow(tm2), odim2=1)
+
+    #Update the masked value
+    masked[,as.scalar(missing_col)] = masked[,as.scalar(missing_col)] * R
+
+    result = replace(target = tm2, pattern = NaN, replacement = 0)
+    result = result + masked
+
+    print("Result method2")
+    print(toString(result))
+
+  } else if(method == "method3"){
+    #METHOD 3
+    #Split the matrix into containing NaN values (missing records) and not 
containing NaN values (M2 records)
+    I = (rowSums(is.nan(tm2))!=0)
+    missing = removeEmpty(target=filled_matrix, margin="rows", select=I)
+
+    Y = (rowSums(is.nan(tm2))==0)
+    M3 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
+
+    #Create a random subset
+    random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, 
sparsity = 0.5, seed = 33))
+    #ensure that random_matrix has at least 1 value
+    if(as.scalar(colSums(random_matrix)) < 1) { random_matrix = matrix(1, rows 
= nrow(M3), cols = 1)}

Review Comment:
   avoid such single line if then blocks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] mboehm7 commented on a diff in pull request #1879: [SYSTEMDS-3153] Missing value imputation using KNN

Reply via email to