Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 12:13:33PM -0700:
> On 6/25/20 8:31 AM, Ingo Schwarze wrote:

>> Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
>> as synonyms looks a bit like asking for the best colour of a bikeshed.
>> Given that the standard already contains the redundancy of requiring
>> both "C" and "POSIX", maybe it is more consistent to also require
>> both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
>> greatly.

> The only thought I had along those lines was that I thought the "C"
> locale came from the C standard, and might be best left to the C
> committee to standardize, while this group controls the "POSIX"
> locale definition.  I suspect those following the POSIX standards
> would end up implementing both, regardless of which specification
> defines each.

That sounds quite reasonable to me.

Yours,
  Ingo



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Alan,

Alan Coopersmith wrote on Thu, Jun 25, 2020 at 07:59:39AM -0700:
> On 6/25/20 6:33 AM, Hans Aberg wrote:

>> Perhaps there should be a default UTF-8 locale: It seems that the
>> current construct does not apply so well to it.

> If the goal is to standardize existing behavior the standard could define
> the C.UTF-8 locale (or perhaps a POSIX.UTF-8 locale) that a number of
> systems already have, which is the standard C/POSIX locale with just the
> character set changed to UTF-8 instead.

This idea makes a lot of sense to me.

If the Austin Group decides that it wants to go into that direction,
i would make sure that both OpenBSD and the software i publish use
that name for a locale with these properties and consistently
recommend using that name.  Both already support a locale with these
properties and select it if the user asks for C.UTF-8 or POSIX.UTF-8,
but so far, they recommend that users specify en_US.UTF-8 (for
historical reasons), which is a bit unfortunate because it looks
like requesting cultural conventions for a particular country, which
is not the intention.

Whether to standardize only C.UTF-8 or both C.UTF-8 and POSIX.UTF-8
as synonyms looks a bit like asking for the best colour of a bikeshed.
Given that the standard already contains the redundancy of requiring
both "C" and "POSIX", maybe it is more consistent to also require
both "C.UTF-8" and "POSIX.UTF-8", but i don't think that matters
greatly.

Yours,
  Ingo



Re: LC_CTYPE=UTF-8

2020-06-25 Thread Ingo Schwarze
Hi Hans,

Hans Aberg wrote on Thu, Jun 25, 2020 at 10:15:03AM +0200:

> MacOS sets as default LC_CTYPE=UTF-8, not appearing in the 'locale
> -a' list. Then some software interprets this as though the locale
> is C/POSIX, disregards the UTF-8 encoding, and converts all non-ASCII
> (high bit set) char's into octal escape sequences. What is the
> correct interpretation here?

The correct interpretation of "LC_CTYPE=UTF-8" is whatever the
documentation of the respective operating system says.
All POSIX says is:

  https://pubs.opengroup.org/onlinepubs/9699919799/functions/setlocale.html

  The locale argument is a pointer to a character string containing
  the required setting of category.  The contents of this string are
  implementation-defined.

POSIX only specifies the meaning of the strings "C" and "POSIX";
any others are implementation-defined.

For example, the OpenBSD manual page says:

  https://man.openbsd.org/setlocale.3

  The syntax and semantics of the locale argument are not standardized
  and vary among operating systems.  On OpenBSD, if the locale string
  ends with ".UTF-8", the UTF-8 locale is selected; otherwise, the
  "C" locale is selected, which uses the ASCII character set.  If
  the locale contains a dot but does not end with ".UTF-8", setlocale()
  fails.

Which is indeed true here:

   $ uname -a
  OpenBSD isnote.usta.de 6.7 GENERIC.MP#224 amd64
   $ LC_CTYPE=FOOBAR.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap  
  US-ASCII

To the best of my knowledge, we are POSIX-compliant in this respect.
Other systenms are of course free to make different choices.

Even though POSIX says this is implementation-defined, which implies
that operating systems are expected to document their specific rules,
some fail to do so, for example:

  https://man.bsd.lv/FreeBSD-12.0/setlocale.3
  https://man.bsd.lv/NetBSD-8.1/setlocale.3

Some do specify it.  For example, according to

  https://man.bsd.lv/Linux-5.06/setlocale.3

the string "UTF-8" would be invalid because it lacks the "language"
part which is mandatory on Linux.

For example, on a very old Linux system i have access to:

   $ uname -a
  Linux donnerwolke.asta.kit.edu 4.9.0-0.bpo.3-686 #1 SMP \
Debian 4.9.30-2+deb9u5~bpo8+1 (2017-09-28) i686 GNU/Linux
   $ LC_CTYPE=en_US.UTF-8 locale charmap
  UTF-8
   $ LC_CTYPE=UTF-8 locale charmap
  locale: Cannot set LC_CTYPE to default locale: No such file or directory
  locale: Cannot set LC_ALL to default locale: No such file or directory
  ANSI_X3.4-1968

Yours,
  Ingo



Re: behaviour upon non-matching globs (Was: Arrays)

2019-06-02 Thread Ingo Schwarze
Hi Steven,

Steven Penny wrote on Sun, Jun 02, 2019 at 10:48:53AM -0500:
> On Sun, Jun 2, 2019 at 10:39 AM Chet Ramey wrote:

>> You might want to reconsider this proposal, given the pervasive use of
>> tools like grep as filters and as components of pipelines.

> Its misleading that you omitted the next paragraph.

I admire Chet's polite understatement, it made my day when i first
saw it.

I would have expressed my reaction more bluntly:  You are welcome
to design your own operating system elsewhere, but what you propose
is off-topic on this list, which is about UNIX.

Please just stop talking about totally breaking basic tools like grep
and ls.  And no, adding knobs is not an excuse for that.

Back to lurking,
  Ingo



Re: In defence of compound aliases

2019-01-14 Thread Ingo Schwarze
Hi Martijn,

Martijn Dekker wrote on Mon, Jan 14, 2019 at 06:59:14AM +0100:

> Indeed, it's not as if aliases are some sort of strange anomaly in the
> programming world. In other languages, similar features are usually
> called 'macros'. C library sources are loaded with them.

It depends.  Some software is indeed riddled with macros.
Other software does actively avoid them where possible.

> Nobody calls that a bad idea, or tells library programmers that they
> should use functions instead --

We do exactly that in OpenBSD, and we say exactly that, even with
a special emphasis, and we have been saying so for a long time.
Weeding out as much usage of macros as possible has been among the
most important refactoring techniques employed in LibreSSL, but not
only there; it is a general consensus diligently applied throughout
the system.

Using a C preprocessor macro containing unbalanced braces or
parantheses almost guarantees to get your patch rejected outright
and instantly.  Even using normal function-style macros is among
the most frequent reasons for getting patches rejected or tweaking
asked for before commit.  I actively avoid macros even for integer
constants where possible and try to use enums instead.  About the
only use of macros that is uncontroversial is for named bits in
unsigned integer variables intended for use with "|" and "&"
operators, i.e. to store groups of boolean flags.

In a nutshell, you more easily get away with using "goto" than with
using macros, and you certainly get away more easily with using "goto"
in creative ways than with using macros in creative ways.

> because macros/aliases and functions are good at different things.

Macros are best at making code unreadable, prone to bugs, hostile
to debugging, and at giving away many benefits of compile-time
checks and type safety.

> Creative use of aliases may be a Bad Idea for casual shell scripters,

We definitely consider creative use of macros a Bad Idea even for
the most capable C programmers.

Also note that we generally advise against using the shell for any
kind of serious programming since there a few languages making safe
programming practices as hard as the shell, even when used in very
conventional ways without any special creativity.

Note that i'm intentionally not commenting on the standardization
of shell aliases.  I don't really care how they are standardized,
and i don't think i would ever use them in any non-trivial shell
program.  Even in interactive shell use, i only use a very small
number of aliases in the most trivial ways - and avoid them for
mostly the same reasons as frowning upon C macros.

I merely thought it might be useful to point out that the statement
of "Nobody calls that a bad idea" grossly mismatches reality.

Now, back to lurking...

Yours,
  Ingo