>>>> 2013/06/03 21:43 +0000, Rick James >>>> Soundex is the 'right' approach, but it needs improvement. So, find an improvement, then do something like this... <<<<<<<< Hashing involves somekind normalizing, and in my case I see no means to it; otherwise I would not have considered something so costly. On the other hand, maybe ....
I am comparing lists of place-names, and I want to match, say, any of "Mount Saint Francis" or "MT ST FRANCIS" or "MOUNT ST FRANCIS" or "MT SAINT FRANCIS"--but it is not all standard abbreviations. Sometimes there is "Galvestn" or "Galvston" or "Galvstn" for "Galveston", and it is not always vowel-letter deletion, either: "Ft Benj Harrison", "FT BENJAMIN HARRISON", "Ft Benj Harsn"; "CLVR MIL ACAD", "Culver Milt Acad". Anyhow, I gave up on a perfect solution, and instead added to each name the name padded with '%'s. On joining the longer name is used, but instead of the shorter the padded is used after "LIKE", if "LOCATE" also fails to match, and overall the Levenstein edit distance is used only for a check, with short-circuit "AND" and "OR" supposed (and the timing is such that I believe it is): ON (LOCATE(Bookk.Burgh, PO.Burgh) > 0 OR LOCATE(PO.Burgh, Bookk.Burgh) > 0 OR CHAR_LENGTH(Bookk.Burgh) > CHAR_LENGTH(PO.Burgh) AND Bookk.Burgh LIKE PO.pBurgh OR CHAR_LENGTH(Bookk.Burgh) < CHAR_LENGTH(PO.Burgh) AND PO.Burgh LIKE Bookk.pBurgh) AND mismatch(Bookk.Burgh, PO.Burgh, 1, 2, 1) < 8 IS NOT FALSE It does not match "MOUNT ST FRANCIS" and "MT SAINT FRANCIS". At least for LOCATE and LIKE there are linear-time algorithms. All along I assumed that in the end some of the mismatching will be handled by hand. It is not that big a list, but doing all by hand is far too much. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql