[GitHub] [systemds] mboehm7 commented on a diff in pull request #1888: [SYSTEMDS-3153] Imputation for all missing columns and seed handling

via GitHub Fri, 25 Aug 2023 10:38:20 -0700


mboehm7 commented on code in PR #1888:
URL: https://github.com/apache/systemds/pull/1888#discussion_r1305940915



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -42,14 +42,15 @@
 #                              with M<<N and S<<N, but suboptimal imputation.
 # seed       Root seed value for random/sample calls for deterministic behavior
 #            -1 for true randomization
+# sparsity   Size of the sample for 'dist_sample' method

Review Comment:
   Please, call it sample_frac to indicate the sample fraction. We want to take 
a sample of tuples from the input matrix, not create a sparse random matrix.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -76,30 +79,32 @@ m_imputeByKNN = function(Matrix[Double] X, String 
method="dist", Int seed=-1)
     #Get the minimum distance row-wise computation
     minimum_index = rowIndexMin(distance_matrix)
 
-    #Position of missing values in per row in which column
-    position = rowSums(is.nan(X))
-    position = position * minimum_index
+    #Loop through each column
+    parfor(i in 1:nrow(missing_col_index), check = 0){
+      #Position of missing values in per row in which column
+      position = masked[,as.scalar(missing_col_index[i,1])]
+      position = position * minimum_index

Review Comment:
   For a record with multiple missing values, the top-1 nearest neighbor 
(computed over all features) is the same for all columns with missing values. 
Hence, we should be able to vectorize this imputation by (1) getting the index 
of the nearest neighbor, (2) construct a permutation matrix (via table) based 
on these indexes, (3) matrix multiply the permutation matrix with the data to 
obtain the entire records A, and then (4) impute via X * (Mask==0) + A * Mask. 
If you don't manage to do this vectorization, leave the parfor and I'll improve 
it afterwards. 



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -42,14 +42,15 @@
 #                              with M<<N and S<<N, but suboptimal imputation.
 # seed       Root seed value for random/sample calls for deterministic behavior
 #            -1 for true randomization
+# sparsity   Size of the sample for 'dist_sample' method
 # 
------------------------------------------------------------------------------
 #
 # OUTPUT:
 # 
------------------------------------------------------------------------------
 # result     Imputed dataset
 # 
------------------------------------------------------------------------------
 
-m_imputeByKNN = function(Matrix[Double] X, String method="dist", Int seed=-1)
+m_imputeByKNN = function(Matrix[Double] X, String method="dist", Int 
seed=-1,Int sparsity = 0.5)
   return(Matrix[Double] result)
 {
   #TODO fix seed handling (only root seed)

Review Comment:
   remove these TODOs once done.



##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -143,7 +156,11 @@ m_imputeByKNN = function(Matrix[Double] X, String 
method="dist", Int seed=-1)
     M3 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
 
     #Create a random subset
-    random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1, 
sparsity = 0.5, seed = seed))
+    if(seed == -1){
+      random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 
1, sparsity = sparsity))
+    } else {
+      random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 
1, sparsity = sparsity, seed = seed))
+    }

Review Comment:
   I think we might have misunderstood us when talking about randomization: my 
comment was to create a random matrix from the passed root seed and generate as 
many seeds as you need for subsequent rand/sample calls. Right now we create a 
random matrix here, but don't actually perform sampling of records. Please 
select records by either calling sample and then creating a permutation matrix 
(e.g., table(seq(), sample) %*% X) or via thesholding (e.g., rand()<sample_frac 
followed by remove empty)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] mboehm7 commented on a diff in pull request #1888: [SYSTEMDS-3153] Imputation for all missing columns and seed handling

Reply via email to