Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-15 Thread J. Roeleveld
On Thursday 03 December 2009 20:20:03 fe...@crowfix.com wrote:
 I have a project which requires normalizing names, and by that, I mean
 converting to lower case etc, whatever eliminates redundancies.  I
 know Unicode has a different normalize meaning, but for my purposes,
 that has already been done.  Maybe I should call it standardization or
 make up a new cromulent word.
 
 By which I really mean I am confused by a lot of advice I have gotten
 from USAians who get by with the good old 7 bit ASCII character set on
 a daily basis, whether it be written in Unicode or not.
 
 One of the puzzles to me is all the accented chars.  Umlauts, etc.  I
 am not trying to convert names for permanent purposes but for internal
 comparison.  In Germany is a district Busingen, with an umlauted
 'u'.  Is it reasonable to consider it the same word whether with or
 without the unlauted u?  French has the cedilla and acute and grave
 accents.  Spanish has the tilde n.  Scandinavian languages (all?
 some?) have the o with a slash.
 
 Or put another way, I don't know much about German, French, Spanish,
 etc keyboards.  Do your keyboards have any of the extra keys, all of
 them?  Are German keyboards and French and Spanish keyboards as
 restricted to their own languages as US keyboards are?  If you have to
 hit two or three keys to keep the umlauts, accents, and tildes, do you
 get lazy sometimes and type the base character by itself?  Is it even
 considered the base character, or is it considered lazy and sloppy,
 much as I get complaints about typing thru because through is too
 much trouble?
 
 I need something the equivalent of the C function strcasecmp() which
 not only ignores case, but all other differences without distinction,
 whatever they may be.  If leaving off umlauts horrifies academics and
 purists but is what people do in the real world, I want to take that
 into consideration, so that if one person uses the ummlaut and another
 doesn't, it won't generated two separate entries.  But if leaving off
 the umlaut or accent is a distinct place name, then I can't do that --
 but if real world people do that and live with the confusion, then I
 guess I have to make a different choice.
 
 Yes, I am something of an ignorant American.  I know some Japanese,
 French, and Spanish, but not the details of everyday usage.  I'd like
 to learn.
 
Hi Felix,

Apart from what was already mentioned, you might want to also consider the 
following:

1) Even though people tend to try to do it correctly, non-natives can still 
make mistakes with the names. These mistakes are frowned upon by the natives, 
but are a part of live.

2) Names of cities can change with time, example:
New York used to be called Nieuw Amsterdam (Or New Amsterdam in english)

3) Some cities have multiple valid spellings in the same language:
The Hague = Den Haag or  's Gravenhage (yes, the apostrofe before the s 
is part of the second version of the name)

An easier option might be to filter on the post-codes, these should be unique 
and if you put the countries international abreviation in front of it, like 
so: NL-1234 AA or D-12345, you have a single field to check and then link 
the actual city-name to the postcode area.

Disclaimer: I have no clue if these 2 postcodes actually exist

--
Joost



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-05 Thread daid kahl
 I have a project which requires normalizing names, and by that, I mean
 converting to lower case etc, whatever eliminates redundancies.  I
 know Unicode has a different normalize meaning, but for my purposes,
 that has already been done.  Maybe I should call it standardization or
 make up a new cromulent word.

 By which I really mean I am confused by a lot of advice I have gotten
 from USAians who get by with the good old 7 bit ASCII character set on
 a daily basis, whether it be written in Unicode or not.

 Yes, I am something of an ignorant American.  I know some Japanese,
 French, and Spanish, but not the details of everyday usage.  I'd like
 to learn.

Your project sounds interesting, but I have little to contribute on
the technical side.

I'm curious about your handling of Japanese, just because I'm living
outside Tokyo these days.  My grasp on Japanese is basically rubbish,
but I can at least claim to know a thing or two.

To keep this in line with your stated application, I actually wonder
how you handle Tokyo.  For pronunciation purposes, if you put it in
hiragana and literally romanized it, you'd probably get Toukyou.  In
Japanese a double-vowel just extends the sound and isn't a dipthong
(and usually o is extended by u and only rarely another o).  For a lot
of cases on the double 'oo' they'll Romanize the second 'o' as an 'h',
since other wise someone will pronounce it like (a) fool.  So, take
a family name Ohshiro.  Probably it should be Romanized Oohshiro,
but then people would say something like seeing fireworks.

Tokyo is Romanized this way, according to one culture book I read,
because everyone knows both the o's are extended!  I'm sure all these
people also know that kyo is a single syllable, too!  So it's not
To-key-oh it's just To-kyo where both syllables are extended from
the double oo.

Osaka is also an extended O at the beginning as I recall, and Kyoto is
the same case as Tokyo (incidentally, the Chinese characters for those
two cities are the same and just reversed in order!).

Again to speak to the original application, I don't know who types
Tookyoo or Tohkyoh or Toukyou.  Probably no one because it's generally
Romanized as we all know it.  But for typing purposes, Japanese type
the pronunciation of words via hiragana and then a little list pops up
and they select the word they want.  So in this sense, they are typing
Toukyou into the keyboard...just it's in hiragana.

If you had any questions about Japanese things, I could ask a
colleague.  They are all happy to answer questions.

Regards,
daid



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-05 Thread felix
On Sun, Dec 06, 2009 at 10:58:59AM +0900, daid kahl wrote:

 I'm curious about your handling of Japanese, just because I'm living
 outside Tokyo these days.  My grasp on Japanese is basically rubbish,
 but I can at least claim to know a thing or two.

Our handling is simple -- we don't yet.  I don't know how to handle
things like that, or the previous example of Copenhagen in different
languages.  Look at Naples -- that's not what Italins call it.  Venice
is really bad -- no idea how English got it so mangled.  Speaking of
Japanese, their word for Mexico (last time I checked) was taken from
the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than
the more more natural may-hee-ko if they had taken it straight from
Spanish.

As long a things stay in Unicode in the native language, we will do
alright.  It's covering accepted mistakes (I can't think of a better
term) that is the problem -- thus my worries I'd have to include all
the accented and unaccented versions.

I learned enough Japanese to travel and hold bare bones questions.

As for romanization of Japanese, how do you even know which system to
use?  Just as Peking is now Beijing, I have seen Tokyo with bars over
the Os and of course without.  That is the same problem as Rome and
Roma.

As for ToKyo being two syllables ... I think it depends on how one
defines syllables.  Ak a Japanese to pronounce three (san) slowly, and
it wil be two syllables, sa-n, saw uhn.  Ask for three hundred which
comes out as sambyaku because the n syllable changes sound when it
sounds better, and they will make quite a few syllables out of it,
such as (I am guessing now) saw-umm-bee-yaw-koo.  To write Tokyo in
the proper furigana is probably something like toh-o-kee-yoh-o.

 Kyoto is the same case as Tokyo (incidentally, the Chinese
 characters for those two cities are the same and just reversed in
 order!).

Nope -- Tokyo is ??, east capital.  Kyoto is ??, capital city.  Kyo
is the same, to is different.

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-05 Thread daid kahl
 Our handling is simple -- we don't yet.  I don't know how to handle
 things like that, or the previous example of Copenhagen in different
 languages.  Look at Naples -- that's not what Italins call it.  Venice
 is really bad -- no idea how English got it so mangled.  Speaking of
 Japanese, their word for Mexico (last time I checked) was taken from
 the English MEKS-ih-ko and comes out as may-kee-shoo-ko rather than
 the more more natural may-hee-ko if they had taken it straight from
 Spanish.

Yeah, you get all kinds of crazy.  For a long time I couldn't
understand why 'computer' is in katakana (ie: taken from English) and
'calculus' isn't.  As it turns out, the Japanese invented calculus
independent of Newton and Leibniz.

 As for ToKyo being two syllables ... I think it depends on how one
 defines syllables.  Ak a Japanese to pronounce three (san) slowly, and
 it wil be two syllables, sa-n, saw uhn.  Ask for three hundred which
 comes out as sambyaku because the n syllable changes sound when it
 sounds better, and they will make quite a few syllables out of it,
 such as (I am guessing now) saw-umm-bee-yaw-koo.  To write Tokyo in
 the proper furigana is probably something like toh-o-kee-yoh-o.

Well, I don't think n is really a syllable.  It's a sound, and it's
the only part of the syllabary in Japanese that doesn't have a vowel.
I'm not really convinced this is a syllable in reality.

The proper way to write Tokyo for syllabary would be to-u-kyo-u I
think, but I'm not certain.  But really that's misleading because
you're *not* supposed to pronounce the sounds twice, you just extend
them, so they aren't really syllables either, they are just modifiers.


 Kyoto is the same case as Tokyo (incidentally, the Chinese
 characters for those two cities are the same and just reversed in
 order!).

 Nope -- Tokyo is 東京, east capital.  Kyoto is 京都, capital city.  Kyo
 is the same, to is different.

Huh.  I wonder how the hell I came up with that?  I'm convinced I did
not decide that on my own but that someone told me.  And they told me
I'm sure because I remember the story that went with it.  Very
strange.  But you're absolutely right.

Regards,
daid



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-05 Thread daid kahl
 such as (I am guessing now) saw-umm-bee-yaw-koo.  To write Tokyo in
 the proper furigana is probably something like toh-o-kee-yoh-o.

Oh, I should mention that this is in writing correct.  But the yo is a
subscript, so it's also a modifier, so the ki part isn't pronounced,
it's modified into a different sound...

Good times.

~daid



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-05 Thread felix
On Sun, Dec 06, 2009 at 11:45:43AM +0900, daid kahl wrote:
 Well, I don't think n is really a syllable.  It's a sound, and it's
 the only part of the syllabary in Japanese that doesn't have a vowel.
 I'm not really convinced this is a syllable in reality.

It's certainly a syllable in their syllabaries, and their opinion is
all that counts ... it is *their* language ...

 The proper way to write Tokyo for syllabary would be to-u-kyo-u I

No, they don't have kyo in the syllabaries.  The furigana I have seen
say that is ki-yo, two syllables.

Now I may be full of it, as most of what I learned was 30 years ago,
and I never got beyond reading and writing at a third or fourth grade
level.  I imagine Japanese readers of this are snickering at the crazy
foreigners.

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-04 Thread Patrick Holthaus
Hey!

 So do people type in Busingen different ways depending on how they
 feel, do some people always leave off the umlaut, do some always use
 it? 

You cannot simply leave the umlaut out since it is considered as a separate 
letter for itself. You cannot choose whether to write an ö or an o. Like 
Renat said, there are words that completely change their meaning when 
exchanging the characters.

I think this is especially true if it comes to names. While people get used to 
spellings like Goettingen for Göttingen, it looks odd and wrong to Germans. 
Like someone who doesn't have the character on the keyboard ;)
Also keep in mind that there are cities that are spelled with oe or ae by 
design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled 
with an ö instead. It would simply be wrong.

Patrick


signature.asc
Description: This is a digitally signed message part.


Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-04 Thread felix
On Fri, Dec 04, 2009 at 10:17:30AM +0100, Patrick Holthaus wrote:

 You cannot simply leave the umlaut out since it is considered as a separate 
 letter for itself. You cannot choose whether to write an ? or an o. Like 
 Renat said, there are words that completely change their meaning when 
 exchanging the characters.
 
 I think this is especially true if it comes to names. While people get used 
 to 
 spellings like Goettingen for G?ttingen, it looks odd and wrong to Germans. 
 Like someone who doesn't have the character on the keyboard ;)
 Also keep in mind that there are cities that are spelled with oe or ae by 
 design. (Soest, Oelde, Aerzen, Oestinghausen etc.) Those cannot be spelled 
 with an ? instead. It would simply be wrong.

OK, that settles it :-)

It seems the message you folks are trying to pound into my head is
that people don't just casually drop the umlauts and accents.  That's
what was bugging me -- if it is an extra key or weird combinations
like in emacs, maybe people would skip it often enough that we would
have to allow for that.

This is a better answer than I had feared because now I don't have to
sweat weird transliterations.  There may still be some, but probably
not enough to worry about.

Now on to other mysteries, like why our (American) customer thinks
people in French Guayana (sp?) are going to write French Guayana for
the country name.  Even my thick skull doesn't expect people living in
Deutschland (probably spelled wrong too, it is very late in a long and
tiring day, so apologies in advance, and if it is correct, apologies
for not recognizing that :-) to write Germany ...


Thanks for not pounding my head too heavily ...

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-04 Thread Volker Armin Hemmann
On Freitag 04 Dezember 2009, fe...@crowfix.com wrote:
  If enough Europeans are in the habit of taking
 shortcuts and skipping umlauts and accents and cedilla and tildes,

we don't. Because skipping Umlaut, accentco creates a completly new word. 
Probably one that is already there.

Munster is a town

Müenster/Muenster is a town.

You can not drop it. Ever. Only the idiots at cnn and foxnews drop it.



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-04 Thread Alan McKinnon
On Friday 04 December 2009 15:42:56 Volker Armin Hemmann wrote:
 On Freitag 04 Dezember 2009, fe...@crowfix.com wrote:
   If enough Europeans are in the habit of taking
  shortcuts and skipping umlauts and accents and cedilla and tildes,
 
 we don't. Because skipping Umlaut, accentco creates a completly new word.
 Probably one that is already there.
 
 Munster is a town
 
 Müenster/Muenster is a town.
 
 You can not drop it. Ever. Only the idiots at cnn and foxnews drop it.
 

And there's always a contrary:

In Afrikaans (major ZA language derived from Dutch) there are two accent 
characters:

- deelteken: two dots above certain vowel
- kappie:carat-like symbol above certain vowels

In all cases, these characters are used simply as modifiers of the vowel. They 
change inflection or the pronounciation of the vowel, not the letter itself. 
Sometimes they just make spelling easier:

The verb modifier to indicate past tense is the prefix ge and the verb for 
eat is eet. So ate translates to geeet. Three consecutive e's looks 
weird so this word is written geëet

-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-04 Thread Neil Bothwick
On Fri, 4 Dec 2009 22:50:52 +0200, Alan McKinnon wrote:

 Three consecutive e's looks weird

Are you calling my laptop weird? 

;-)


-- 
Neil Bothwick

THE BORG: Calm, Cool and Collective...


signature.asc
Description: PGP signature


Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Renat Golubchyk
Hi!

On Thu, 3 Dec 2009 11:20:03 -0800
fe...@crowfix.com wrote:
 In Germany is a district Busingen, with an umlauted 'u'.  Is it
 reasonable to consider it the same word whether with or without the
 unlauted u?

No. For many words it would be ok, but not for all. For example,
drucken means to print, drücken (with an umlaut) means to
press. In German you can exchange an umlaut with the combination base
letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words
with the combination oe that is in that particular case does not mean
ö. So it's not straight forward, especially with names. Those may
have a rather odd spelling for historical reasons.

 Or put another way, I don't know much about German, French, Spanish,
 etc keyboards.  Do your keyboards have any of the extra keys, all of
 them?  Are German keyboards and French and Spanish keyboards as
 restricted to their own languages as US keyboards are?  If you have to
 hit two or three keys to keep the umlauts, accents, and tildes, do you
 get lazy sometimes and type the base character by itself?  Is it even
 considered the base character, or is it considered lazy and sloppy,
 much as I get complaints about typing thru because through is too
 much trouble?

German keyboards have keys for all umlauts and 'ß'. You can google for
pictures of different keyboard layouts.

 I need something the equivalent of the C function strcasecmp() which
 not only ignores case, but all other differences without distinction,
 whatever they may be.

I'd suggest you use a unicode library. BTW, what about cyrillic
letters or other alphabets? Those may have nothing to do with ASCII. Or
is your project restricted to latin letters?


Cheers,
Renat

-- 
Probleme kann man niemals mit derselben Denkweise loesen,
durch die sie entstanden sind.
  (Einstein)


signature.asc
Description: PGP signature


Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread felix
On Thu, Dec 03, 2009 at 08:50:08PM +0100, Renat Golubchyk wrote:

 I'd suggest you use a unicode library. BTW, what about cyrillic
 letters or other alphabets? Those may have nothing to do with ASCII. Or
 is your project restricted to latin letters?

The data is already in normalized Unicode.  My problem is eliminating
errors from near misses :-( Cyrillic doesn't look like the same
problem -- no accents that I can see.  Chinese, Japanese, etc, same as
far as I know.  Arabic has lots of tricks on combining letters and
leaving out vowels, so it is probably an entirely different problem.

One thing I did not make clear is that this is for place names only,
like cities and whatever the equivalent of a US state or Canadian
province is, such as Busingen.

So do people type in Busingen different ways depending on how they
feel, do some people always leave off the umlaut, do some always use
it?  My biggest annoyance is that a lot of the google results come
from Americans full of theory about languages they only know from the
W3C recommendations.  Maybe email or real documents follow proper
usage much more closely than addresses on a web form, but I don't care
about them.  Maybe web forms in Germany, where they want a district,
do as many web sites do in English and have a menu of possible
districts, in which case no one types in umlauts anyway :-)

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Renat Golubchyk
On Thu, 3 Dec 2009 12:07:26 -0800
fe...@crowfix.com wrote:
 So do people type in Busingen different ways depending on how they
 feel, do some people always leave off the umlaut, do some always use
 it?

If you want to leave of the umlaut you have to be absolutely sure that
there exists no other place with the spelling without umlaut. Otherwise
you may have collisions sometimes.


-- 
Probleme kann man niemals mit derselben Denkweise loesen,
durch die sie entstanden sind.
  (Einstein)


signature.asc
Description: PGP signature


Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Volker Armin Hemmann
On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote:
 Hi!
 
 On Thu, 3 Dec 2009 11:20:03 -0800
 
 fe...@crowfix.com wrote:
  In Germany is a district Busingen, with an umlauted 'u'.  Is it
  reasonable to consider it the same word whether with or without the
  unlauted u?
 
 No. For many words it would be ok, but not for all. For example,
 drucken means to print, drücken (with an umlaut) means to
 press. In German you can exchange an umlaut with the combination base
 letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words
 with the combination oe that is in that particular case does not mean
 ö. So it's not straight forward, especially with names. Those may
 have a rather odd spelling for historical reasons.

and it is hilarious to see american media fuck that up almost every time ... 
;)



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Alan McKinnon
On Friday 04 December 2009 00:07:33 Volker Armin Hemmann wrote:
 On Donnerstag 03 Dezember 2009, Renat Golubchyk wrote:
  Hi!
 
  On Thu, 3 Dec 2009 11:20:03 -0800
 
  fe...@crowfix.com wrote:
   In Germany is a district Busingen, with an umlauted 'u'.  Is it
   reasonable to consider it the same word whether with or without the
   unlauted u?
 
  No. For many words it would be ok, but not for all. For example,
  drucken means to print, drücken (with an umlaut) means to
  press. In German you can exchange an umlaut with the combination base
  letter + e, i.e. ü -- ue, ö -- oe, and ß -- ss. There are words
  with the combination oe that is in that particular case does not mean
  ö. So it's not straight forward, especially with names. Those may
  have a rather odd spelling for historical reasons.
 
 and it is hilarious to see american media fuck that up almost every time
  ... ;)
 

What's even more funny is hearing news readers on the South Africa public 
broadcaster try to pronounce regular *English* words...

-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Francisco Ares
On Thu, Dec 3, 2009 at 6:29 PM, Renat Golubchyk ragerm...@gmx.net wrote:

 On Thu, 3 Dec 2009 12:07:26 -0800
 fe...@crowfix.com wrote:
  So do people type in Busingen different ways depending on how they
  feel, do some people always leave off the umlaut, do some always use
  it?

 If you want to leave of the umlaut you have to be absolutely sure that
 there exists no other place with the spelling without umlaut. Otherwise
 you may have collisions sometimes.


What about a set of dictionaries? And also a library for mistyped word
search?

Francisco


Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Arttu V.
On 12/3/09, fe...@crowfix.com fe...@crowfix.com wrote:
 I have a project which requires normalizing names, and by that, I mean
 converting to lower case etc, whatever eliminates redundancies.

I assume you have already removed the language problem from the
equation? I.e., the fact that København, Copenhague, Kööpenhamina and
Copenhagen all mean the same place, just in different European
languages (Danish, Spanish, Finnish and English, in that order).

If you have input in multiple languages then it is not just about
umlauts or no umlauts ...

-- 
Arttu V.



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread felix
On Thu, Dec 03, 2009 at 08:32:45PM -0200, Francisco Ares wrote:

 What about a set of dictionaries? And also a library for mistyped word
 search?

Way too much effort for this.  Nice idea, might even be fun, but it's
just trying to avoid the common things, and I mainly wondered about
how often people whose keyboards have accents etc skip them if they
have the chance and what the repercussions would be.

We have people already who enter Brooklyn for their city instead of
New York, which might even be technically correct, but it doesn't
match anything else and we don't correct it because it doesn't happen
often enough.  There may be people in Louisiana who use the original
French spelling for all I know; we don't handle that either.  Or ditto
for Spanish names in the southwest that might have a tilde.

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread felix
On Fri, Dec 04, 2009 at 12:38:34AM +0200, Arttu V. wrote:

 I assume you have already removed the language problem from the
 equation? I.e., the fact that K?benhavn, Copenhague, K??penhamina and
 Copenhagen all mean the same place, just in different European
 languages (Danish, Spanish, Finnish and English, in that order).

We're trying to go by the name in the native language, but that might
not be possible, in which case I guess we'll have to get all the
possible translations.  It certainly is messy.

 If you have input in multiple languages then it is not just about
 umlauts or no umlauts ...

I've certainly learned a bit of that ... so far we only have to deal
with country codes (no problem) and disticts (can't be t many, I
hope).

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Volker Armin Hemmann
look at my name, ok?

Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error. 
Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to ae, 
oe or ue.



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread Alan McKinnon
On Friday 04 December 2009 02:03:23 Volker Armin Hemmann wrote:
 look at my name, ok?
 
 Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error.
 Mistake. Fail. If you can not enter ä, ö or ü, you must transform them to
  ae, oe or ue.
 


Your name shows here in 7-bit ASCII:

 
Volker Armin Hemmann volkerar...@googlemail.com

typo?


-- 
alan dot mckinnon at gmail dot com



Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

2009-12-03 Thread felix
On Fri, Dec 04, 2009 at 01:03:23AM +0100, Volker Armin Hemmann wrote:
 look at my name, ok?
 
 Just dropping the Umlaut is wrong. No if, but, maybe. It is wrong. Error. 
 Mistake. Fail. If you can not enter ?, ? or ?, you must transform them to ae, 
 oe or ue.

I'd like to find a program which would do that!  Seriously.  But
anyway, the purpose of this is not to transform names so our antique
ASCII-7 computers can store them, but to eliminate redundant records.
For instance, we get data from vendors for all cities and states,
geolocation data, which has its own redundancies, such as both FORT
WORTH and FT WORTH, or SAINT LOUIS and ST LOUIS.  But we have to
convert to upper case, get rid of punctuation, get rid of extra white
space, etc, and all that is independent of the locale.  I want to do
the same for unicode.  If enough Europeans are in the habit of taking
shortcuts and skipping umlauts and accents and cedilla and tildes,
then I'd like to standardize the data for lookup.  This has nothing to
do with converting people's names for storage.  We don't even store
the transformed place name.

-- 
... _._. ._ ._. . _._. ._. ___ .__ ._. . .__. ._ .. ._.
 Felix Finch: scarecrow repairman  rocket surgeon / fe...@crowfix.com
  GPG = E987 4493 C860 246C 3B1E  6477 7838 76E9 182E 8151 ITAR license #4933
I've found a solution to Fermat's Last Theorem but I see I've run out of room o