Shafaq-Siddiqi commented on pull request #1139:
URL: https://github.com/apache/systemds/pull/1139#issuecomment-753533113


   > This PR adds a new built-in mdedup for detecting duplicates in frames 
using matching dependencies (like Street 0.95, City 0.90 -> ZIP 1.0).
   > @Shafaq-Siddiqi For simplicity, used Jaccard similarity, but if found out 
that Levenshtein or Jaro distance could also be used, should I also add them? 
To compute Jaccard similarity between rows (strings) of a vector (nx1) the map 
with 2 args was added dist = map(Xi, "(x, y) -> UtilFunctions.jaccardSim(x, 
y)").
   > 
   > I also modified discoverFD built-in by setting diag to 1.
   
   Thanks @OlgaOvcharenko, for now it is fine to have Jaccard similarity only 
we will come to other methods when we will extend the overall implementation. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to