Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Thursday 03 December 2009 20:20:03 fe...@crowfix.com wrote: I have a project which requires normalizing names, and by that, I mean converting to lower case etc, whatever eliminates redundancies. I know Unicode has a different normalize meaning, but for my purposes, that has already been done. Maybe I should call it standardization or make up a new cromulent word. By which I really mean I am confused by a lot of advice I have gotten from USAians who get by with the good old 7 bit ASCII character set on a daily basis, whether it be written in Unicode or not. One of the puzzles to me is all the accented chars. Umlauts, etc. I am not trying to convert names for permanent purposes but for internal comparison. In Germany is a district Busingen, with an umlauted 'u'. Is it reasonable to consider it the same word whether with or without the unlauted u? French has the cedilla and acute and grave accents. Spanish has the tilde n. Scandinavian languages (all? some?) have the o with a slash. Or put another way, I don't know much about German, French, Spanish, etc keyboards. Do your keyboards have any of the extra keys, all of them? Are German keyboards and French and Spanish keyboards as restricted to their own languages as US keyboards are? If you have to hit two or three keys to keep the umlauts, accents, and tildes, do you get lazy sometimes and type the base character by itself? Is it even considered the base character, or is it considered lazy and sloppy, much as I get complaints about typing thru because through is too much trouble? I need something the equivalent of the C function strcasecmp() which not only ignores case, but all other differences without distinction, whatever they may be. If leaving off umlauts horrifies academics and purists but is what people do in the real world, I want to take that into consideration, so that if one person uses the ummlaut and another doesn't, it won't generated two separate entries. But if leaving off the umlaut or accent is a distinct place name, then I can't do that -- but if real world people do that and live with the confusion, then I guess I have to make a different choice. Yes, I am something of an ignorant American. I know some Japanese, French, and Spanish, but not the details of everyday usage. I'd like to learn. Hi Felix, Apart from what was already mentioned, you might want to also consider the following: 1) Even though people tend to try to do it correctly, non-natives can still make mistakes with the names. These mistakes are frowned upon by the natives, but are a part of live. 2) Names of cities can change with time, example: New York used to be called Nieuw Amsterdam (Or New Amsterdam in english) 3) Some cities have multiple valid spellings in the same language: The Hague = Den Haag or 's Gravenhage (yes, the apostrofe before the s is part of the second version of the name) An easier option might be to filter on the post-codes, these should be unique and if you put the countries international abreviation in front of it, like so: NL-1234 AA or D-12345, you have a single field to check and then link the actual city-name to the postcode area. Disclaimer: I have no clue if these 2 postcodes actually exist -- Joost
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
I have a project which requires normalizing names, and by that, I mean converting to lower case etc, whatever eliminates redundancies. I know Unicode has a different normalize meaning, but for my purposes, that has already been done. Maybe I should call it standardization or make up a new cromulent word. By which I really mean I am confused by a lot of advice I have gotten from USAians who get by with the good old 7 bit ASCII character set on a daily basis, whether it be written in Unicode or not. Yes, I am something of an ignorant American. I know some Japanese, French, and Spanish, but not the details of everyday usage. I'd like to learn. Your project sounds interesting, but I have little to contribute on the technical side. I'm curious about your handling of Japanese, just because I'm living outside Tokyo these days. My grasp on Japanese is basically rubbish, but I can at least claim to know a thing or two. To keep this in line with your stated application, I actually wonder how you handle Tokyo. For pronunciation purposes, if you put it in hiragana and literally romanized it, you'd probably get Toukyou. In Japanese a double-vowel just extends the sound and isn't a dipthong (and usually o is extended by u and only rarely another o). For a lot of cases on the double 'oo' they'll Romanize the second 'o' as an 'h', since other wise someone will pronounce it like (a) fool. So, take a family name Ohshiro. Probably it should be Romanized Oohshiro, but then people would say something like seeing fireworks. Tokyo is Romanized this way, according to one culture book I read, because everyone knows both the o's are extended! I'm sure all these people also know that kyo is a single syllable, too! So it's not To-key-oh it's just To-kyo where both syllables are extended from the double oo. Osaka is also an extended O at the beginning as I recall, and Kyoto is the same case as Tokyo (incidentally, the Chinese characters for those two cities are the same and just reversed in order!). Again to speak to the original application, I don't know who types Tookyoo or Tohkyoh or Toukyou. Probably no one because it's generally Romanized as we all know it. But for typing purposes, Japanese type the pronunciation of words via hiragana and then a little list pops up and they select the word they want. So in this sense, they are typing Toukyou into the keyboard...just it's in hiragana. If you had any questions about Japanese things, I could ask a colleague. They are all happy to answer questions. Regards, daid
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Sun, Dec 06, 2009 at 10:58:59AM +0900, daid kahl wrote: I'm curious about your handling of Japanese, just because I'm living outside Tokyo these days. My grasp on Japanese is basically rubbish, but I can at least claim to know a thing or two. Our handling is simple -- we don't yet. I don't know how to handle things like that, or the previous example of Copenhagen in different languages. Look at Naples -- that's not what Italins call it. Venice is really bad -- no idea how English got it so mangled. Speaking of Japanese, their word for Mexico (last time I checked) was taken from the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than the more more natural may-hee-ko if they had taken it straight from Spanish. As long a things stay in Unicode in the native language, we will do alright. It's covering accepted mistakes (I can't think of a better term) that is the problem -- thus my worries I'd have to include all the accented and unaccented versions. I learned enough Japanese to travel and hold bare bones questions. As for romanization of Japanese, how do you even know which system to use? Just as Peking is now Beijing, I have seen Tokyo with bars over the Os and of course without. That is the same problem as Rome and Roma. As for ToKyo being two syllables ... I think it depends on how one defines syllables. Ak a Japanese to pronounce three (san) slowly, and it wil be two syllables, sa-n, saw uhn. Ask for three hundred which comes out as sambyaku because the n syllable changes sound when it sounds better, and they will make quite a few syllables out of it, such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in the proper furigana is probably something like toh-o-kee-yoh-o. Kyoto is the same case as Tokyo (incidentally, the Chinese characters for those two cities are the same and just reversed in order!). Nope -- Tokyo is ??, east capital. Kyoto is ??, capital city. Kyo is the same, to is different. -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Our handling is simple -- we don't yet. I don't know how to handle things like that, or the previous example of Copenhagen in different languages. Look at Naples -- that's not what Italins call it. Venice is really bad -- no idea how English got it so mangled. Speaking of Japanese, their word for Mexico (last time I checked) was taken from the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than the more more natural may-hee-ko if they had taken it straight from Spanish. Yeah, you get all kinds of crazy. For a long time I couldn't understand why 'computer' is in katakana (ie: taken from English) and 'calculus' isn't. As it turns out, the Japanese invented calculus independent of Newton and Leibniz. As for ToKyo being two syllables ... I think it depends on how one defines syllables. Ak a Japanese to pronounce three (san) slowly, and it wil be two syllables, sa-n, saw uhn. Ask for three hundred which comes out as sambyaku because the n syllable changes sound when it sounds better, and they will make quite a few syllables out of it, such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in the proper furigana is probably something like toh-o-kee-yoh-o. Well, I don't think n is really a syllable. It's a sound, and it's the only part of the syllabary in Japanese that doesn't have a vowel. I'm not really convinced this is a syllable in reality. The proper way to write Tokyo for syllabary would be to-u-kyo-u I think, but I'm not certain. But really that's misleading because you're *not* supposed to pronounce the sounds twice, you just extend them, so they aren't really syllables either, they are just modifiers. Kyoto is the same case as Tokyo (incidentally, the Chinese characters for those two cities are the same and just reversed in order!). Nope -- Tokyo is 東京, east capital. Kyoto is 京都, capital city. Kyo is the same, to is different. Huh. I wonder how the hell I came up with that? I'm convinced I did not decide that on my own but that someone told me. And they told me I'm sure because I remember the story that went with it. Very strange. But you're absolutely right. Regards, daid
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
such as (I am guessing now) saw-umm-bee-yaw-koo. To write Tokyo in the proper furigana is probably something like toh-o-kee-yoh-o. Oh, I should mention that this is in writing correct. But the yo is a subscript, so it's also a modifier, so the ki part isn't pronounced, it's modified into a different sound... Good times. ~daid
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Sun, Dec 06, 2009 at 11:45:43AM +0900, daid kahl wrote: Well, I don't think n is really a syllable. It's a sound, and it's the only part of the syllabary in Japanese that doesn't have a vowel. I'm not really convinced this is a syllable in reality. It's certainly a syllable in their syllabaries, and their opinion is all that counts ... it is *their* language ... The proper way to write Tokyo for syllabary would be to-u-kyo-u I No, they don't have kyo in the syllabaries. The furigana I have seen say that is ki-yo, two syllables. Now I may be full of it, as most of what I learned was 30 years ago, and I never got beyond reading and writing at a third or fourth grade level. I imagine Japanese readers of this are snickering at the crazy foreigners. -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Hey! So do people type in Busingen different ways depending on how they feel, do some people always leave off the umlaut, do some always use it? You cannot simply leave the umlaut out since it is considered as a separate letter for itself. You cannot choose whether to write an ö or an o. Like Renat said, there are words that completely change their meaning when exchanging the characters. I think this is especially true if it comes to names. While people get used to spellings like Goettingen for Göttingen, it looks odd and wrong to Germans. Like someone who doesn't have the character on the keyboard ;) Also keep in mind that there are cities that are spelled with oe or ae by design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled with an ö instead. It would simply be wrong. Patrick signature.asc Description: This is a digitally signed message part.
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Fri, Dec 04, 2009 at 10:17:30AM +0100, Patrick Holthaus wrote: You cannot simply leave the umlaut out since it is considered as a separate letter for itself. You cannot choose whether to write an ? or an o. Like Renat said, there are words that completely change their meaning when exchanging the characters. I think this is especially true if it comes to names. While people get used to spellings like Goettingen for G?ttingen, it looks odd and wrong to Germans. Like someone who doesn't have the character on the keyboard ;) Also keep in mind that there are cities that are spelled with oe or ae by design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled with an ? instead. It would simply be wrong. OK, that settles it :-) It seems the message you folks are trying to pound into my head is that people don't just casually drop the umlauts and accents. That's what was bugging me -- if it is an extra key or weird combinations like in emacs, maybe people would skip it often enough that we would have to allow for that. This is a better answer than I had feared because now I don't have to sweat weird transliterations. There may still be some, but probably not enough to worry about. Now on to other mysteries, like why our (American) customer thinks people in French Guayana (sp?) are going to write French Guayana for the country name. Even my thick skull doesn't expect people living in Deutschland (probably spelled wrong too, it is very late in a long and tiring day, so apologies in advance, and if it is correct, apologies for not recognizing that :-) to write Germany ... Thanks for not pounding my head too heavily ... -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Freitag 04 Dezember 2009, fe...@crowfix.com wrote: If enough Europeans are in the habit of taking shortcuts and skipping umlauts and accents and cedilla and tildes, we don't. Because skipping Umlaut, accentco creates a completly new word. Probably one that is already there. Munster is a town Müenster/Muenster is a town. You can not drop it. Ever. Only the idiots at cnn and foxnews drop it.
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Friday 04 December 2009 15:42:56 Volker Armin Hemmann wrote: On Freitag 04 Dezember 2009, fe...@crowfix.com wrote: If enough Europeans are in the habit of taking shortcuts and skipping umlauts and accents and cedilla and tildes, we don't. Because skipping Umlaut, accentco creates a completly new word. Probably one that is already there. Munster is a town Müenster/Muenster is a town. You can not drop it. Ever. Only the idiots at cnn and foxnews drop it. And there's always a contrary: In Afrikaans (major ZA language derived from Dutch) there are two accent characters: - deelteken: two dots above certain vowel - kappie:carat-like symbol above certain vowels In all cases, these characters are used simply as modifiers of the vowel. They change inflection or the pronounciation of the vowel, not the letter itself. Sometimes they just make spelling easier: The verb modifier to indicate past tense is the prefix ge and the verb for eat is eet. So ate translates to geeet. Three consecutive e's looks weird so this word is written geëet -- alan dot mckinnon at gmail dot com
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Fri, 4 Dec 2009 22:50:52 +0200, Alan McKinnon wrote: Three consecutive e's looks weird Are you calling my laptop weird? ;-) -- Neil Bothwick THE BORG: Calm, Cool and Collective... signature.asc Description: PGP signature
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
Hi! On Thu, 3 Dec 2009 11:20:03 -0800 fe...@crowfix.com wrote: In Germany is a district Busingen, with an umlauted 'u'. Is it reasonable to consider it the same word whether with or without the unlauted u? No. For many words it would be ok, but not for all. For example, drucken means to print, drücken (with an umlaut) means to press. In German you can exchange an umlaut with the combination base letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words with the combination oe that is in that particular case does not mean ö. So it's not straight forward, especially with names. Those may have a rather odd spelling for historical reasons. Or put another way, I don't know much about German, French, Spanish, etc keyboards. Do your keyboards have any of the extra keys, all of them? Are German keyboards and French and Spanish keyboards as restricted to their own languages as US keyboards are? If you have to hit two or three keys to keep the umlauts, accents, and tildes, do you get lazy sometimes and type the base character by itself? Is it even considered the base character, or is it considered lazy and sloppy, much as I get complaints about typing thru because through is too much trouble? German keyboards have keys for all umlauts and 'ß'. You can google for pictures of different keyboard layouts. I need something the equivalent of the C function strcasecmp() which not only ignores case, but all other differences without distinction, whatever they may be. I'd suggest you use a unicode library. BTW, what about cyrillic letters or other alphabets? Those may have nothing to do with ASCII. Or is your project restricted to latin letters? Cheers, Renat -- Probleme kann man niemals mit derselben Denkweise loesen, durch die sie entstanden sind. (Einstein) signature.asc Description: PGP signature
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Thu, Dec 03, 2009 at 08:50:08PM +0100, Renat Golubchyk wrote: I'd suggest you use a unicode library. BTW, what about cyrillic letters or other alphabets? Those may have nothing to do with ASCII. Or is your project restricted to latin letters? The data is already in normalized Unicode. My problem is eliminating errors from near misses :-( Cyrillic doesn't look like the same problem -- no accents that I can see. Chinese, Japanese, etc, same as far as I know. Arabic has lots of tricks on combining letters and leaving out vowels, so it is probably an entirely different problem. One thing I did not make clear is that this is for place names only, like cities and whatever the equivalent of a US state or Canadian province is, such as Busingen. So do people type in Busingen different ways depending on how they feel, do some people always leave off the umlaut, do some always use it? My biggest annoyance is that a lot of the google results come from Americans full of theory about languages they only know from the W3C recommendations. Maybe email or real documents follow proper usage much more closely than addresses on a web form, but I don't care about them. Maybe web forms in Germany, where they want a district, do as many web sites do in English and have a menu of possible districts, in which case no one types in umlauts anyway :-) -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Thu, 3 Dec 2009 12:07:26 -0800 fe...@crowfix.com wrote: So do people type in Busingen different ways depending on how they feel, do some people always leave off the umlaut, do some always use it? If you want to leave of the umlaut you have to be absolutely sure that there exists no other place with the spelling without umlaut. Otherwise you may have collisions sometimes. -- Probleme kann man niemals mit derselben Denkweise loesen, durch die sie entstanden sind. (Einstein) signature.asc Description: PGP signature
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote: Hi! On Thu, 3 Dec 2009 11:20:03 -0800 fe...@crowfix.com wrote: In Germany is a district Busingen, with an umlauted 'u'. Is it reasonable to consider it the same word whether with or without the unlauted u? No. For many words it would be ok, but not for all. For example, drucken means to print, drücken (with an umlaut) means to press. In German you can exchange an umlaut with the combination base letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words with the combination oe that is in that particular case does not mean ö. So it's not straight forward, especially with names. Those may have a rather odd spelling for historical reasons. and it is hilarious to see american media fuck that up almost every time ... ;)
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Friday 04 December 2009 00:07:33 Volker Armin Hemmann wrote: On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote: Hi! On Thu, 3 Dec 2009 11:20:03 -0800 fe...@crowfix.com wrote: In Germany is a district Busingen, with an umlauted 'u'. Is it reasonable to consider it the same word whether with or without the unlauted u? No. For many words it would be ok, but not for all. For example, drucken means to print, drücken (with an umlaut) means to press. In German you can exchange an umlaut with the combination base letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words with the combination oe that is in that particular case does not mean ö. So it's not straight forward, especially with names. Those may have a rather odd spelling for historical reasons. and it is hilarious to see american media fuck that up almost every time ... ;) What's even more funny is hearing news readers on the South Africa public broadcaster try to pronounce regular *English* words... -- alan dot mckinnon at gmail dot com
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Thu, Dec 3, 2009 at 6:29 PM, Renat Golubchyk ragerm...@gmx.net wrote: On Thu, 3 Dec 2009 12:07:26 -0800 fe...@crowfix.com wrote: So do people type in Busingen different ways depending on how they feel, do some people always leave off the umlaut, do some always use it? If you want to leave of the umlaut you have to be absolutely sure that there exists no other place with the spelling without umlaut. Otherwise you may have collisions sometimes. What about a set of dictionaries? And also a library for mistyped word search? Francisco
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On 12/3/09, fe...@crowfix.com fe...@crowfix.com wrote: I have a project which requires normalizing names, and by that, I mean converting to lower case etc, whatever eliminates redundancies. I assume you have already removed the language problem from the equation? I.e., the fact that København, Copenhague, Kööpenhamina and Copenhagen all mean the same place, just in different European languages (Danish, Spanish, Finnish and English, in that order). If you have input in multiple languages then it is not just about umlauts or no umlauts ... -- Arttu V.
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Thu, Dec 03, 2009 at 08:32:45PM -0200, Francisco Ares wrote: What about a set of dictionaries? And also a library for mistyped word search? Way too much effort for this. Nice idea, might even be fun, but it's just trying to avoid the common things, and I mainly wondered about how often people whose keyboards have accents etc skip them if they have the chance and what the repercussions would be. We have people already who enter Brooklyn for their city instead of New York, which might even be technically correct, but it doesn't match anything else and we don't correct it because it doesn't happen often enough. There may be people in Louisiana who use the original French spelling for all I know; we don't handle that either. Or ditto for Spanish names in the southwest that might have a tilde. -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Fri, Dec 04, 2009 at 12:38:34AM +0200, Arttu V. wrote: I assume you have already removed the language problem from the equation? I.e., the fact that K?benhavn, Copenhague, K??penhamina and Copenhagen all mean the same place, just in different European languages (Danish, Spanish, Finnish and English, in that order). We're trying to go by the name in the native language, but that might not be possible, in which case I guess we'll have to get all the possible translations. It certainly is messy. If you have input in multiple languages then it is not just about umlauts or no umlauts ... I've certainly learned a bit of that ... so far we only have to deal with country codes (no problem) and disticts (can't be t many, I hope). -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
look at my name, ok? Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error. Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to ae, oe or ue.
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Friday 04 December 2009 02:03:23 Volker Armin Hemmann wrote: look at my name, ok? Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error. Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to ae, oe or ue. Your name shows here in 7-bit ASCII: Volker Armin Hemmann volkerar...@googlemail.com typo? -- alan dot mckinnon at gmail dot com
Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long
On Fri, Dec 04, 2009 at 01:03:23AM +0100, Volker Armin Hemmann wrote: look at my name, ok? Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error. Mistake. Fail. If you can not enter ?, ? or ?, you must transform them to ae, oe or ue. I'd like to find a program which would do that! Seriously. But anyway, the purpose of this is not to transform names so our antique ASCII-7 computers can store them, but to eliminate redundant records. For instance, we get data from vendors for all cities and states, geolocation data, which has its own redundancies, such as both FORT WORTH and FT WORTH, or SAINT LOUIS and ST LOUIS. But we have to convert to upper case, get rid of punctuation, get rid of extra white space, etc, and all that is independent of the locale. I want to do the same for unicode. If enough Europeans are in the habit of taking shortcuts and skipping umlauts and accents and cedilla and tildes, then I'd like to standardize the data for lookup. This has nothing to do with converting people's names for storage. We don't even store the transformed place name. -- ... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._. Felix Finch: scarecrow repairman rocket surgeon / fe...@crowfix.com GPG = E987 4493 C860 246C 3B1E 6477 7838 76E9 182E 8151 ITAR license #4933 I've found a solution to Fermat's Last Theorem but I see I've run out of room o