Shafaq-Siddiqi commented on a change in pull request #1139:
URL: https://github.com/apache/systemds/pull/1139#discussion_r551332965



##########
File path: scripts/builtin/mdedup.dml
##########
@@ -0,0 +1,114 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#------------------------------------------------------------------------------------------------------------------
+
+# Implements builtin for deduplication using matching dependencies (like 
Street 0.95, City 0.90 -> ZIP 1.0)
+# and Jaccard distance.
+# 
+# INPUT PARAMETERS:
+# 
-----------------------------------------------------------------------------------------------------------------
+# NAME            TYPE              DEFAULT     MEANING
+# 
-----------------------------------------------------------------------------------------------------------------
+# X               Frame               --       Input Frame X
+# LHSfeatures     Matrix[Integer]     --       A matrix 1xd with numbers of 
columns for MDs
+#                                              (like Street 0.95, City 0.90 -> 
ZIP 1.0)
+# LHSthreshold    Matrix[Double]      --       A matrix 1xd with threshold 
values in interval [0, 1] for MDs
+# RHSfeatures     Matrix[Integer]     --       A matrix 1xd with numbers of 
columns for MDs
+# RHSthreshold    Matrix[Double]      --       A matrix 1xd with threshold 
values in interval [0, 1] for MDs
+# verbose         Boolean             --       To print the output
+# 
-----------------------------------------------------------------------------------------------------------------
+#
+# Output(s)
+# 
-----------------------------------------------------------------------------------------------------------------
+# NAME                 TYPE         DEFAULT     MEANING
+# 
-----------------------------------------------------------------------------------------------------------------
+# MD              Matrix[Double]      ---       Matrix nx1 of duplicates
+
+s_mdedup = function(Frame[String] X, Matrix[Double] LHSfeatures, 
Matrix[Double] LHSthreshold,
+    Matrix[Double] RHSfeatures, Matrix[Double] RHSthreshold, Boolean verbose)
+  return(Matrix[Double] MD)
+{
+  n = nrow(X)
+  d = ncol(X)
+
+  if (0 > (ncol(LHSfeatures) + ncol(RHSfeatures)) > d)
+    stop("Invalid input: thresholds should in interval [0, " + d + "]")
+
+  if ((ncol(LHSfeatures) != ncol(LHSthreshold)) | (ncol(RHSfeatures) != 
ncol(RHSthreshold)))
+      stop("Invalid input: number of thresholds and columns to compare should 
be equal for LHS and RHS.")
+
+  if (max(LHSfeatures) > d | max(RHSfeatures) > d)
+    stop("Invalid input: feature values should be less than " + d)
+
+  if (sum(LHSthreshold > 1) > 0 | sum(RHSthreshold > 1) > 0)
+    stop("Invalid input: threshold values should be in the interval [0, 1].")
+
+  MD = matrix(0, n, 1)
+
+  LHS_MD = getMDAdjacency(X, LHSfeatures, LHSthreshold)
+
+  if (sum(LHS_MD) > 0) {
+    RHS_MD = getMDAdjacency(X, RHSfeatures, RHSthreshold)
+  }
+
+  MD = detectDuplicates(LHS_MD, LHS_MD)
+
+  if(verbose)
+    print(toString(MD))
+}
+
+getMDAdjacency = function(Frame[String] X, Matrix[Double] features, 
Matrix[Double] thresholds)
+  return(Matrix[Double] adjacency)
+{
+  n = nrow(X)
+  d = ncol(X)
+  adjacency = matrix(0, n, n)
+
+  for(i in 1 : ncol(features)) {
+    # slice col
+    pos = as.scalar(features[1, i])
+    Xi = X[, pos]
+
+    # distances between words in each row of col
+    dist = map(Xi, "(x, y) -> UtilFunctions.jaccardSim(x, y)")
+    jaccardDist = as.matrix(dist)
+    threshold = as.scalar(thresholds[1, i])
+#print(toString(jaccardDist))
+    if(i == 1) {
+      adjacency = jaccardDist >= threshold
+    } else {
+      adjacency = adjacency & (jaccardDist >= threshold)
+    }
+#print(toString(adjacency))
+    # break if one of MDs is false
+    if (sum(adjacency) == 0)

Review comment:
       explicitly setting the value of for loop variable will not break the for 
loop here so if you want to terminate a loop early use a while loop please.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to