>>>> 2013/06/03 21:43 +0000, Rick James >>>>
Soundex is the 'right' approach, but it needs improvement.  So, find an 
improvement, then do something like this...
<<<<<<<<
Hashing involves somekind normalizing, and in my case I see no means to it; 
otherwise I would not have considered something so costly. On the other hand, 
maybe ....

I am comparing lists of place-names, and I want to match, say, any of "Mount 
Saint Francis" or "MT ST FRANCIS" or "MOUNT ST FRANCIS" or "MT SAINT 
FRANCIS"--but it is not all standard abbreviations. Sometimes there is 
"Galvestn" or "Galvston" or "Galvstn" for "Galveston", and it is not always 
vowel-letter deletion, either: "Ft Benj Harrison", "FT BENJAMIN HARRISON",  "Ft 
Benj Harsn"; "CLVR MIL ACAD", "Culver Milt Acad".

Anyhow, I gave up on a perfect solution, and instead added to each name the 
name padded with '%'s. On joining the longer name is used, but instead of the 
shorter the padded is used after "LIKE", if "LOCATE" also fails to match, and 
overall the Levenstein edit distance is used only for a check, with 
short-circuit "AND" and "OR" supposed (and the timing is such that I believe it 
is):

ON (LOCATE(Bookk.Burgh, PO.Burgh) > 0 OR LOCATE(PO.Burgh, Bookk.Burgh) > 0 OR 
CHAR_LENGTH(Bookk.Burgh) > CHAR_LENGTH(PO.Burgh) AND Bookk.Burgh LIKE PO.pBurgh 
OR CHAR_LENGTH(Bookk.Burgh) < CHAR_LENGTH(PO.Burgh) AND PO.Burgh LIKE 
Bookk.pBurgh)
AND mismatch(Bookk.Burgh, PO.Burgh, 1, 2, 1) < 8 IS NOT FALSE

It does not match "MOUNT ST FRANCIS" and "MT SAINT FRANCIS".
At least for LOCATE and LIKE there are linear-time algorithms.

All along I assumed that in the end some of the mismatching will be handled by 
hand. It is not that big a list, but doing all by hand is far too much. 


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Reply via email to