mboehm7 commented on code in PR #1888:
URL: https://github.com/apache/systemds/pull/1888#discussion_r1305940915
##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -42,14 +42,15 @@
# with M<<N and S<<N, but suboptimal imputation.
# seed Root seed value for random/sample calls for deterministic behavior
# -1 for true randomization
+# sparsity Size of the sample for 'dist_sample' method
Review Comment:
Please, call it sample_frac to indicate the sample fraction. We want to take
a sample of tuples from the input matrix, not create a sparse random matrix.
##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -76,30 +79,32 @@ m_imputeByKNN = function(Matrix[Double] X, String
method="dist", Int seed=-1)
#Get the minimum distance row-wise computation
minimum_index = rowIndexMin(distance_matrix)
- #Position of missing values in per row in which column
- position = rowSums(is.nan(X))
- position = position * minimum_index
+ #Loop through each column
+ parfor(i in 1:nrow(missing_col_index), check = 0){
+ #Position of missing values in per row in which column
+ position = masked[,as.scalar(missing_col_index[i,1])]
+ position = position * minimum_index
Review Comment:
For a record with multiple missing values, the top-1 nearest neighbor
(computed over all features) is the same for all columns with missing values.
Hence, we should be able to vectorize this imputation by (1) getting the index
of the nearest neighbor, (2) construct a permutation matrix (via table) based
on these indexes, (3) matrix multiply the permutation matrix with the data to
obtain the entire records A, and then (4) impute via X * (Mask==0) + A * Mask.
If you don't manage to do this vectorization, leave the parfor and I'll improve
it afterwards.
##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -42,14 +42,15 @@
# with M<<N and S<<N, but suboptimal imputation.
# seed Root seed value for random/sample calls for deterministic behavior
# -1 for true randomization
+# sparsity Size of the sample for 'dist_sample' method
#
------------------------------------------------------------------------------
#
# OUTPUT:
#
------------------------------------------------------------------------------
# result Imputed dataset
#
------------------------------------------------------------------------------
-m_imputeByKNN = function(Matrix[Double] X, String method="dist", Int seed=-1)
+m_imputeByKNN = function(Matrix[Double] X, String method="dist", Int
seed=-1,Int sparsity = 0.5)
return(Matrix[Double] result)
{
#TODO fix seed handling (only root seed)
Review Comment:
remove these TODOs once done.
##########
scripts/builtin/imputeByKNN.dml:
##########
@@ -143,7 +156,11 @@ m_imputeByKNN = function(Matrix[Double] X, String
method="dist", Int seed=-1)
M3 = removeEmpty(target=filled_matrix, margin = "rows", select = Y)
#Create a random subset
- random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max = 1,
sparsity = 0.5, seed = seed))
+ if(seed == -1){
+ random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max =
1, sparsity = sparsity))
+ } else {
+ random_matrix = ceiling(rand(rows = nrow(M3), cols = 1, min = 0, max =
1, sparsity = sparsity, seed = seed))
+ }
Review Comment:
I think we might have misunderstood us when talking about randomization: my
comment was to create a random matrix from the passed root seed and generate as
many seeds as you need for subsequent rand/sample calls. Right now we create a
random matrix here, but don't actually perform sampling of records. Please
select records by either calling sample and then creating a permutation matrix
(e.g., table(seq(), sample) %*% X) or via thesholding (e.g., rand()<sample_frac
followed by remove empty)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]