[Dhis2-devs] Matching strategies and algorithms

Knut Staring Wed, 05 May 2010 11:47:37 -0700

This ongoing (long) discussion on the OMRS list on name matching
strategies is also relevant for us - it is not only important for
patient, but also for matching facility names and villages/chiefdoms
etc with shapefiles for mapping.


Knut

---------- Forwarded message ----------
From: Jeff Rafter <[email protected]>
Date: Wed, May 5, 2010 at 5:34 PM
Subject: Re: [OPENMRS-DEV] soundex architecture question
To: [email protected]


> i think using using minimal edit distance from dictionary words  or names 
> would be a better implementation if correcting word spelling is the goal. 
> soundex needs to be fine tuned to bantu languages which no one has done 
> successively yet.

Check out the implementation from earlier in this thread... the
Mateme/Ruby bantu soundex is working great in Malawi and it looks like
Dave (and others?) have already ported this to Java for work in
OpenMRS.

>
> On Tue, May 4, 2010 at 11:10 PM, Paul Biondich <[email protected]> 
> wrote:
>>
>> Also, just another gentle reminder... there is another whole "approach" to 
>> matching mis-spelled or misrepresented names:  string sequence matching or 
>> comparator algorithms.
>> Some examples of this are:  Levenshtein Edit Distance (and edit distances in 
>> general), and Longest Common Substring (LCS), and the Jaro-Winkler 
>> methodology.
>> Wikipedia is your friend on these.

I think that Levenshtein distance would be tough for suggesting. Lots
of things would have the same distance but would be wildly different
right? I could be wrong on that though. I have used LCS with account
names in a previous life and it worked well, but mainly for things
like "Bank of America" and "Bank America". These aren't similar
sounding but are clearly related. I could see this having an impact on
names like the we discussed in the other thread: "Mike von der ohe Mc
Kay" versus "Mike McKay".
Secondarily, the person_name_code approach would be more difficult,
you would have to compare the typed in string to all of the names in
the database to determine the distance. We could probably use some
kind of Ur-name and mean distance but that would take some tweaking.
All the best
Jeff

>>
>> -Paul
>> On May 1, 2010, at 1:12 AM, David Thomas wrote:
>>
>> This is excellent.  I'm adding this to my notes.  As we move forward with 
>> kinyarwanda, it seems like an eventual logical step to try to abstract the 
>> lessons learned for other implementers working in non-european languages.  
>> Its on my 'things not to forget about' list.
>>
>> d
>>
>> On 4/28/2010 6:27 AM, Jeff Rafter wrote:
>>
>> Awesome work Dave. I really like that you can add algorithms.
>> In thinking about this more it would be cool if *anyone* could add a new 
>> algorithm. The process is pretty simple and could be accomplished with a 
>> couple tables:
>> soundex_algorithm
>>   soundex_algorithm_id:int
>>   name:string
>>   uppercase_all_letters:boolean, default:true
>>   strip_punctuation:boolean, default:true
>>   keep_first_digit:boolean, default:true
>>   remove_digit_pairs, default:true
>>   remove_leading_zero, default:true
>>   length:int, default:4
>> soundex_algorithm_replacement
>>   soundex_algorithm_step_id:int
>>   soundex_algorithm_id:int # refers to the main algorithm
>>   sort_weight:int
>>   match:string
>>   value:string
>>
>> This could obviously be massaged some more. And I am more or less imagining 
>> there would be regex in the match value which might be over the heads of 
>> non-devs (although the very simple regex in this case might be okay).
>> The other potential problem would be that you may need to specify at what 
>> point you keep the first char (in the Bantu one we have to replace leading 
>> phonemes first). I could imagine in a language where there were spaces / ' 
>> in names that you might consider the next char a leading char as well. So 
>> this could be a crazy road to go down. Just an idea.
>>
>> Great work,
>> Jeff
>> On Wed, Apr 28, 2010 at 7:18 AM, David Thomas <[email protected]> wrote:
>>>
>>> Hi.  The namephonetic module that i checked in a couple of days ago 
>>> basically mirrors everything that's said here.  The one small data model 
>>> change is that each row only stores one encoded name component, which is 
>>> then labeled by a java enum (GIVEN_NAME, MIDDLE_NAME, FAMILY_NAME, 
>>> FAMILY_NAME2).  The module AOP's around savePatient and calculates the code 
>>> for fields that you specify using a global property, meaning that you can 
>>> control what PersonName fields you want to encode, and you can even attach 
>>> them to different soundex algorithms (folks in rwanda get a local name and 
>>> a christian name, which are quite different).
>>>
>>> There's a little JSP page that gives you the ability to re-calculate all 
>>> soundexes for all patients in the db, for migration purposes, or if you 
>>> want to alter your soundex.
>>>
>>> And, i ported jeff's chichewa soundex directly into java, which is included 
>>> in the module -- we needed a starting point for kinyarwanda.
>>>
>>> Finally, the module allows you to add your own algorithm.  All algorithms 
>>> only have to be an extension of the StringEncoder class that lives in the 
>>> apache commons codec jar file.  The module also includes all the algorithms 
>>> that live in this jar file by default -- Soundex, RefinedSoundex, 
>>> Metaphone, Double Metaphone, etc...
>>>
>>> The module however, doesn't affect the default patient lookup in openmrs.  
>>> I didn't connect them.  But the module does have a bunch of methods for 
>>> finding a patient, which i'm using from a touchscreen registration module 
>>> i'm working on.
>>>
>>> dave thomas
>>>
>>> On 4/27/2010 8:19 PM, Jeff Rafter wrote:
>>>
>>> For completeness, here is the code (obviously very simple implementation):
>>> http://github.com/jeffrafter/mateme/blob/master/lib/bantu_soundex.rb
>>> Also for completeness "we" here meant Rodney Tayamika and myself (with lots 
>>> of help later from Evan). This was one of those
>>> situations that absolutely required someone with local knowledge.
>>> I also realized that in the final solution actually moved the soundexes out 
>>> of person_name into a nearly identical table called
>>> person_name_code. This was done so that the name codes would not show up in 
>>> other modules and in places that didn't
>>> make use of the preferred flag:
>>>  desc openmrs_1_5.person_name_code
>>>     -> ;
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> | Field                   | Type        | Null | Key | Default | Extra      
>>>     |
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> | person_name_code_id     | int(11)     | NO   | PRI | NULL    | 
>>> auto_increment |
>>> | person_name_id          | int(11)     | YES  | MUL | NULL    |            
>>>     |
>>> | given_name_code         | varchar(50) | YES  | MUL | NULL    |            
>>>     |
>>> | middle_name_code        | varchar(50) | YES  | MUL | NULL    |            
>>>     |
>>> | family_name_code        | varchar(50) | YES  | MUL | NULL    |            
>>>     |
>>> | family_name2_code       | varchar(50) | YES  |     | NULL    |            
>>>     |
>>> | family_name_suffix_code | varchar(50) | YES  |     | NULL    |            
>>>     |
>>> +-------------------------+-------------+------+-----+---------+----------------+
>>> 7 rows in set (1.16 sec)
>>> I am happy to fill in any more holes here, and am excited to see this 
>>> possibly being modularized.
>>> Cheers,
>>> Jeff
>>>
>>> On Tue, Apr 27, 2010 at 6:02 PM, Jeff Rafter <[email protected]> wrote:
>>>>
>>>> Hey All,
>>>> I am trying to catch up on this thread, somehow I missed it. We (PIH) have 
>>>> implemented Soundex in Mateme (the PIH/Baobab touchscreen system). When we 
>>>> did this re worked the phoneme set of traditional soundex to use Bantu 
>>>> phonemes instead (based on a very old missionary text in Malawi). This had 
>>>> the effect of grouping "l" and "r" for instance, a common cross-phoneme in 
>>>> Malawi. It works *great*. The code is very simple as well. Ultimately Evan 
>>>> Waters asked for a couple of adjustments based on real world clinic use 
>>>> after the first couple months.
>>>> Secondarily, the question of when to do this.
>>>> The best time to calculate the soundex is when you change the name. For us 
>>>> we had control over this in the touchscreen application, but embedding it 
>>>> into a module would be even better because then edits in OpenMRS proper 
>>>> would cause soundex refiguring. To save the soundex we simply add a second 
>>>> PersonName record with soundexes instead of names. This acts a giant 
>>>> soundex cache. When we search, we do a quick calculation on the soundex of 
>>>> the search string and search for the result in our names. Ultimately this 
>>>> makes name searching just as fast as before.
>>>> We had a back log of names so we had a quick one-time process that 
>>>> calculated all of the current names... from then on the system kept up 
>>>> with itself as it was just one additional thing to save whenever a name 
>>>> changed.
>>>> Jeff
>>>> On Wed, Apr 21, 2010 at 4:40 PM, Paul Biondich <[email protected]> 
>>>> wrote:
>>>>>
>>>>> Dave:
>>>>> The traditional criticism of Soundex for our work is that it's algorithm 
>>>>> was inferred from English names, so it's problematic when bringing it to 
>>>>> a Swahili or Rwandan French name space.
>>>>> I know that Shaun Grannis is very up to speed on the various pros and 
>>>>> cons of the phonetic "compression" algorithms, and I'll flag him to make 
>>>>> sure he sees this and responds.
>>>>> Least Common Substring might be better given the circumstances (it tends 
>>>>> to "hide" misspellings and typing transpositions, which are what you are 
>>>>> likely trying to deal with).
>>>>> RE: running these algorithms on-the-fly vs. as a scheduled task... I 
>>>>> think you need to determine first what mechanism(s) you want to use, and 
>>>>> then measure the performance implications appropriately.  In practice, 
>>>>> Shaun's work always pregenerates these...
>>>>> -Paul
>>>>> On Apr 21, 2010, at 11:04 PM, David Thomas wrote:
>>>>>
>>>>> Hi everyone.  We've started a discussion of whether or not to launch into 
>>>>> development of a soundex module, or if other strategies might be more 
>>>>> appropriate.    Here's the dialogue thus far:
>>>>>
>>>>>
>>>>> Me  (dave):
>>>>>
>>>>> Hey, quick question -- the soundex algorithm looks fairly processor 
>>>>> intensive, to the point where i don't think this should be done in 
>>>>> real-time for existing names in the database.  Do you pre-calculate 
>>>>> soundex values for existing patient names, so that the algorithm is just 
>>>>> applied to the search string, and then compared with the pre-calculated 
>>>>> values?
>>>>>
>>>>> I feel like maybe this should be an openmrs module on its own, with a 
>>>>> little data model of pre-calculated values for givenName, middleName,  
>>>>> familyName,  familyName2 and should support multiple implementations of 
>>>>> SoundexAlgorithm.
>>>>>
>>>>> In case you haven't read it (mike, darius), here's the theory:
>>>>> http://www.creativyst.com/Doc/Articles/SoundEx1/SoundEx1.htm
>>>>>
>>>>> Based on people's responses, i'll start (or not) the discussion of the 
>>>>> data model.
>>>>>
>>>>>
>>>>> Darius:
>>>>>
>>>>>
>>>>> Interesting. I'd never read the algorithm before. More research:
>>>>>
>>>>> Another competing algorithm is metaphone, which has apparently been 
>>>>> replaced by double metaphone. I don't know the specific algorithm, so I 
>>>>> don't know if it's easy to translate for Bantu pronounciations.
>>>>> Apache's commons-codec has implementations of various phonetic algorithms 
>>>>> (see API) including Soundex and RefinedSoundex (better for spellcheckers 
>>>>> apparently) that allow you to specify the mapping. (It also has metaphone 
>>>>> but no ability to change that algorithm.)
>>>>>
>>>>> I did a quick test of running the commons-codec soundex algorithm on all 
>>>>> the (few) names in the PersonServiceTest-extranames.xml file in OpenMRS 
>>>>> and got the following result:
>>>>>     Took 136 ms to do 6 person names with 16 components
>>>>>  Presumably the implication is that if you do 6000 names it will take 2+ 
>>>>> minutes on a laptop, which is obviously not fast enough for real-time.
>>>>> So, the real question is whether we should be using a phonetic matching 
>>>>> algorithm (in which case the module you describe is the only way to go) 
>>>>> or if we should be considering leveraging the existing Patient Matching 
>>>>> Module, which uses a much more sophisticated algorithm, probably has 
>>>>> considered a real-time-check use case, but may not be complete. (That 
>>>>> sounds like a question for the OpenMRS developers list. Anyone feel like 
>>>>> sending that?)
>>>>> -Darius
>>>>>
>>>>> ________________________________
>>>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>>>
>>>>> ________________________________
>>>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>
>>> ________________________________
>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>>
>>> ________________________________
>>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>>
>> ________________________________
>> Click here to unsubscribe from OpenMRS Developers' mailing list
>
>
> --
> ========================================
> Mugisha Moses
> Lead Developer
> Appfrica Labs
> P.O. Box 1420 Kampala, Uganda
> http://appfrica.org
> skype name :  mossplix
> twitter: @mugisha
>
>
>
> ________________________________
> Click here to unsubscribe from OpenMRS Developers' mailing list

________________________________
Click here to unsubscribe from OpenMRS Developers' mailing list


-- 
Cheers,
Knut Staring

_______________________________________________
Mailing list: https://launchpad.net/~dhis2-devs
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~dhis2-devs
More help   : https://help.launchpad.net/ListHelp

[Dhis2-devs] Matching strategies and algorithms

Reply via email to