Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

J. Roeleveld Tue, 15 Dec 2009 10:00:58 -0800

On Thursday 03 December 2009 20:20:03 [email protected] wrote:
> I have a project which requires normalizing names, and by that, I mean
> converting to lower case etc, whatever eliminates redundancies.  I
> know Unicode has a different "normalize" meaning, but for my purposes,
> that has already been done.  Maybe I should call it standardization or
> make up a new cromulent word.
> 
> By which I really mean I am confused by a lot of advice I have gotten
> from USAians who get by with the good old 7 bit ASCII character set on
> a daily basis, whether it be written in Unicode or not.
> 
> One of the puzzles to me is all the accented chars.  Umlauts, etc.  I
> am not trying to convert names for permanent purposes but for internal
> comparison.  In Germany is a district "Busingen", with an umlauted
> 'u'.  Is it reasonable to consider it the same word whether with or
> without the unlauted u?  French has the cedilla and acute and grave
> accents.  Spanish has the tilde n.  Scandinavian languages (all?
> some?) have the o with a slash.
> 
> Or put another way, I don't know much about German, French, Spanish,
> etc keyboards.  Do your keyboards have any of the extra keys, all of
> them?  Are German keyboards and French and Spanish keyboards as
> restricted to their own languages as US keyboards are?  If you have to
> hit two or three keys to keep the umlauts, accents, and tildes, do you
> get lazy sometimes and type the base character by itself?  Is it even
> considered the base character, or is it considered lazy and sloppy,
> much as I get complaints about typing "thru" because "through" is too
> much trouble?
> 
> I need something the equivalent of the C function strcasecmp() which
> not only ignores case, but all other differences without distinction,
> whatever they may be.  If leaving off umlauts horrifies academics and
> purists but is what people do in the real world, I want to take that
> into consideration, so that if one person uses the ummlaut and another
> doesn't, it won't generated two separate entries.  But if leaving off
> the umlaut or accent is a distinct place name, then I can't do that --
> but if real world people do that and live with the confusion, then I
> guess I have to make a different choice.
> 
> Yes, I am something of an ignorant American.  I know some Japanese,
> French, and Spanish, but not the details of everyday usage.  I'd like
> to learn.
> 
Hi Felix,


Apart from what was already mentioned, you might want to also consider the 
following:

1) Even though people tend to try to do it correctly, non-natives can still 
make mistakes with the names. These mistakes are frowned upon by the natives, 
but are a part of live.

2) Names of cities can change with time, example:
"New York" used to be called "Nieuw Amsterdam" (Or "New Amsterdam" in english)

3) Some cities have multiple valid spellings in the same language:
The Hague = "Den Haag" or " 's Gravenhage" (yes, the apostrofe before the "s" 
is part of the second version of the name)

An easier option might be to filter on the post-codes, these should be unique 
and if you put the countries international abreviation in front of it, like 
so: "NL-1234 AA" or "D-12345", you have a single field to check and then link 
the actual city-name to the postcode area.

Disclaimer: I have no clue if these 2 postcodes actually exist

--
Joost

Re: [gentoo-user] [OT] Need advice from people who use non-ascii all day long

Reply via email to