Re: Seeing clarification for locale names

Florian Weimer Mon, 15 Feb 2021 08:51:24 -0800

* Marc Haber:

> I would appreciate pointers to documentation, personal opinions, war
> stories, encoding tales, historic lectures, anything that might
> enlighten me and help me build the knowlegde and understanding about
> UNIX locales are supposed to work in Debian GNU/Linux. Thank you in
> advance!


For the charset normalization, it's in the manual:

The only new thing is the @code{normalized codeset} entry.  This is
another goodie which is introduced to help reduce the chaos which
derives from the inability of people to standardize the names of
character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
@w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
codeset} value is generated from the user-provided character set name by
applying the following rules:

@enumerate
@item
Remove all characters besides numbers and letters.
@item
Fold letters to lowercase.
@item
If the same only contains digits prepend the string @code{"iso"}.
@end enumerate

@noindent
So all of the above names will be normalized to @code{iso88591}.  This
allows the program user much more freedom in choosing the locale name.


This code dates back to the mid-90s, I think.

I general, I think it is best to treat locale names as opaque strings.
Parsing them to derive charsets is not going to work (e.g., no charset
can mean ISO-8859-1 or UTF-8, depending on the age of the locale).  To
get the charset of the current locale, you can use “locale -k charmap”,
for example.  It corresponds to the glibc charmap name (of which there
aren't too many).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill

Re: Seeing clarification for locale names

Reply via email to