Re: Seeing clarification for locale names

2021-03-16 Thread Marc Haber
Hi,

I apologize for the late answer. Please keep me in Cc:, I am not
subscribed.

On Mon, Feb 15, 2021 at 05:20:30PM +0100, Florian Weimer wrote:
> * Marc Haber:
> > I would appreciate pointers to documentation, personal opinions, war
> > stories, encoding tales, historic lectures, anything that might
> > enlighten me and help me build the knowlegde and understanding about
> > UNIX locales are supposed to work in Debian GNU/Linux. Thank you in
> > advance!
> 
> For the charset normalization, it's in the manual:



> This code dates back to the mid-90s, I think.

Took me 20+ years to finally notice.

> I general, I think it is best to treat locale names as opaque strings.

What is the recommended setting for the LANG and LC_ variables?
de-DE.UTF-8 or the normalized version?

> Parsing them to derive charsets is not going to work (e.g., no charset
> can mean ISO-8859-1 or UTF-8, depending on the age of the locale).  To
> get the charset of the current locale, you can use “locale -k charmap”,
> for example.  It corresponds to the glibc charmap name (of which there
> aren't too many).

So the recommended way is to just set LANG to the wanted value and then
look whether locale -k charmap will return the expected value? And
'charmap="ANSI_X3.4-1968"' is a telltale sign that I set LANG to a value
that isnt generated on the local system?

Greetings
Marc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421



Re: Seeing clarification for locale names

2021-02-15 Thread Florian Weimer
* Marc Haber:

> I would appreciate pointers to documentation, personal opinions, war
> stories, encoding tales, historic lectures, anything that might
> enlighten me and help me build the knowlegde and understanding about
> UNIX locales are supposed to work in Debian GNU/Linux. Thank you in
> advance!

For the charset normalization, it's in the manual:

The only new thing is the @code{normalized codeset} entry.  This is
another goodie which is introduced to help reduce the chaos which
derives from the inability of people to standardize the names of
character sets.  Instead of @w{ISO-8859-1} one can often see @w{8859-1},
@w{88591}, @w{iso8859-1}, or @w{iso_8859-1}.  The @code{normalized
codeset} value is generated from the user-provided character set name by
applying the following rules:

@enumerate
@item
Remove all characters besides numbers and letters.
@item
Fold letters to lowercase.
@item
If the same only contains digits prepend the string @code{"iso"}.
@end enumerate

@noindent
So all of the above names will be normalized to @code{iso88591}.  This
allows the program user much more freedom in choosing the locale name.


This code dates back to the mid-90s, I think.

I general, I think it is best to treat locale names as opaque strings.
Parsing them to derive charsets is not going to work (e.g., no charset
can mean ISO-8859-1 or UTF-8, depending on the age of the locale).  To
get the charset of the current locale, you can use “locale -k charmap”,
for example.  It corresponds to the glibc charmap name (of which there
aren't too many).

Thanks,
Florian
-- 
Red Hat GmbH, https://de.redhat.com/ , Registered seat: Grasbrunn,
Commercial register: Amtsgericht Muenchen, HRB 153243,
Managing Directors: Charles Cachera, Brian Klemm, Laurie Krebs, Michael O'Neill



Seeing clarification for locale names

2021-02-14 Thread Marc Haber
[Please Cc: me on replies, I am not subscribed to Debian-glibc]

Hi,

I am a bit confused about locale names. In literature, one can see that
a proper locale name is, for example, en_US.UTF-8. This is also what I
write in /etc/locale.gen to have one locale "generated" on my system.

locale -a, however, will print en_US.utf8. I _think_ this is the
intended behavior since there is a normalizing function somewhere in the
glibc sources which lowercases everything and thows out all
interpunction.

Otoh, there are applications that will malfuntion or print a warning if
the locale isn't explicitly set to .UTF-8 (upper case, hyphen).

In my shell profile scripts, I have code that will check whether the
intended locale is actually present on the local system by comparing to
locale -a's output (avoiding a fallback to a non-UTF-8 locale not
knowing about German umlauts if one is available). Hence, my locale
environment variables are all set to the respective .utf8 suffix since
that's what locale -a will print. Is this a wrong approach?

I would appreciate pointers to documentation, personal opinions, war
stories, encoding tales, historic lectures, anything that might
enlighten me and help me build the knowlegde and understanding about
UNIX locales are supposed to work in Debian GNU/Linux. Thank you in
advance!

Greetings
Ma 'Schei? Encoding!' rc

-- 
-
Marc Haber | "I don't trust Computers. They | Mailadresse im Header
Leimen, Germany|  lose things."Winona Ryder | Fon: *49 6224 1600402
Nordisch by Nature |  How to make an American Quilt | Fax: *49 6224 1600421