Github user kturgut commented on the issue:

    https://github.com/apache/spark/pull/17092
  
    @jkbradley @MLnick @sethah @Yunni  @merlintang @akatz  
    It seems LSH will be a perfect fit for matching patient records, if only I 
can figure out how to assign different weights to each column of the patient 
record that I am comparing.  For instance, each record may have 0 to many 
identifiers. if the identifiers match exactly, we consider a solid match.  
However if ID's do not strongly match,  we also look at additional set of 
fields such as name, birthdate, address at different weights. 
    For instance, if the names exactly match, it is stronger than if they match 
with small typos.
    To give different weights for each field we are comparing, should I have to 
write custom distance calculator?
    Or perhaps, should I do a MinHashing and then LSH as a second step as 
described in this document: 
http://web.stanford.edu/class/cs345a/slides/05-LSH.pdf? 
    It does not look like the  AND-OR amplification would help with that, as it 
takes the number of hash-functions as input, and it does not seem like we have 
control over the sensitivity of the hash-functions. 
    I will really appreciate your guidance.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to