On Thursday 03 December 2009 20:20:03 [email protected] wrote: > I have a project which requires normalizing names, and by that, I mean > converting to lower case etc, whatever eliminates redundancies. I > know Unicode has a different "normalize" meaning, but for my purposes, > that has already been done. Maybe I should call it standardization or > make up a new cromulent word. > > By which I really mean I am confused by a lot of advice I have gotten > from USAians who get by with the good old 7 bit ASCII character set on > a daily basis, whether it be written in Unicode or not. > > One of the puzzles to me is all the accented chars. Umlauts, etc. I > am not trying to convert names for permanent purposes but for internal > comparison. In Germany is a district "Busingen", with an umlauted > 'u'. Is it reasonable to consider it the same word whether with or > without the unlauted u? French has the cedilla and acute and grave > accents. Spanish has the tilde n. Scandinavian languages (all? > some?) have the o with a slash. > > Or put another way, I don't know much about German, French, Spanish, > etc keyboards. Do your keyboards have any of the extra keys, all of > them? Are German keyboards and French and Spanish keyboards as > restricted to their own languages as US keyboards are? If you have to > hit two or three keys to keep the umlauts, accents, and tildes, do you > get lazy sometimes and type the base character by itself? Is it even > considered the base character, or is it considered lazy and sloppy, > much as I get complaints about typing "thru" because "through" is too > much trouble? > > I need something the equivalent of the C function strcasecmp() which > not only ignores case, but all other differences without distinction, > whatever they may be. If leaving off umlauts horrifies academics and > purists but is what people do in the real world, I want to take that > into consideration, so that if one person uses the ummlaut and another > doesn't, it won't generated two separate entries. But if leaving off > the umlaut or accent is a distinct place name, then I can't do that -- > but if real world people do that and live with the confusion, then I > guess I have to make a different choice. > > Yes, I am something of an ignorant American. I know some Japanese, > French, and Spanish, but not the details of everyday usage. I'd like > to learn. > Hi Felix,
Apart from what was already mentioned, you might want to also consider the following: 1) Even though people tend to try to do it correctly, non-natives can still make mistakes with the names. These mistakes are frowned upon by the natives, but are a part of live. 2) Names of cities can change with time, example: "New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english) 3) Some cities have multiple valid spellings in the same language: The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s" is part of the second version of the name) An easier option might be to filter on the post-codes, these should be unique and if you put the countries international abreviation in front of it, like so: "NL-1234 AA" or "D-12345", you have a single field to check and then link the actual city-name to the postcode area. Disclaimer: I have no clue if these 2 postcodes actually exist -- Joost

