[Issue 8 drafts 0001561]: clarify what kind of data shell variables need to be able to hold

2022-02-08 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1561 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1561
Category:   Shell and Utilities
Type:   Enhancement Request
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:various 
Page Number:N/A 
Line Number:N/A 
Final Accepted Text: 
== 
Date Submitted: 2022-02-01 00:10 UTC
Last Modified:  2022-02-09 01:58 UTC
== 
Summary:clarify what kind of data shell variables need to be
able to hold
== 

-- 
 (0005669) kre (reporter) - 2022-02-09 01:58
 https://www.austingroupbugs.net/view.php?id=1561#c5669 
-- 
Re https://www.austingroupbugs.net/view.php?id=1561#c5668:

 but one could implicitly deduce

No, that' snot how it works,  If the standard requires something to be
done, it will say so.   If it requires somethingnot to be done, it will
say so.   If it says nothing on an issue, then nothing is required.

If this causes an interoperability problem, then the standard is broken,
and should be fixed.

Here, nothing needs to be done (oe perhaps beyond what has already been
done)
as it is already unspecified what happens in this case (which means
applications
cannot depend upon either behaviour):

>From 2.9.1.6 in draft 2.1:

   It is unspecified whether environment variables that were passed
   to the shell when it was invoked, but were not used to initialize 
   shell variables (see Section 2.5.3) because they had invalid names,
   are included in the environment passed to execl( ) and (if execl( )

   fails as described above) to the new shell.

Don't get all uptight that it says "environment variables" - as far as the
stanrdard is concerned, everything in the environment is an environment
variable, as that's all the standard defines to go there.   Anything with
a valid name (which must be terminated by '=' to be valid) gets turned
into
a shell variable, and exported.   Everything else has an invalid name, and
can be ignored by the shell (or used for any other purpose the shell
desires). 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-01 00:10 calestyo   New Issue
2022-02-01 00:10 calestyo   Name  => Christoph Anton
Mitterer
2022-02-01 00:10 calestyo   Section   => various 
2022-02-01 00:10 calestyo   Page Number   => N/A 
2022-02-01 00:10 calestyo   Line Number   => N/A 
2022-02-01 19:33 mirabilos  Note Added: 0005645  
2022-02-01 19:44 calestyo   Note Added: 0005647  
2022-02-01 20:52 chet_ramey Note Added: 0005649  
2022-02-01 23:07 kreNote Added: 0005650  
2022-02-02 15:15 chet_ramey Note Added: 0005652  
2022-02-02 16:39 calestyo   Note Added: 0005653  
2022-02-02 18:44 kreNote Added: 0005654  
2022-02-06 11:18 mirabilos  Note Added: 0005662  
2022-02-06 18:18 chet_ramey Note Added: 0005665  
2022-02-06 23:17 kreNote Added: 0005666  
2022-02-08 15:14 calestyo   Note Added: 0005668  
2022-02-09 01:58 kreNote Added: 0005669  
==




Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

2022-02-08 Thread Christoph Anton Mitterer via austin-group-l at The Open Group
Hey Eric.


On Tue, 2022-02-08 at 15:21 -0600, Eric Blake wrote:
> Yes. And another fallout of that requirement: you cannot have a
> single
> POSIX system supporting both ASCII and EBCDIC locales.

What does that mean in practise... does e.g. Linux/glibc ship these
locales just for the purpose of iconv and others... and apart from that
*any* glibc system will *always* be based ASCII and *never* on
EBCDIC...
...or could one implementation (like glibc) support actually both, as
long as it doesn't switch from one to the other on one concrete host
and while that is running?


> > Doesn't that also mean that POSIX effectively forbids UTF16 or
> > UTF32
> > and actually any >1-byte fixed-encoding?
> > Cause there it would have to be "padded" with 0x00?
> 
> Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as
> its basis.  Again, iconv and wide-character library calls (such as
> wprintf) can support conversion of files into and out of those
> encodings, but that is only file contents; all file names, syscalls,
> and other aspects of the POSIX environment for cross-process
> communication outside of file contents will use multi-byte encodings
> where no multi-byte sequence has an embedded 0x00 byte, and NOT wide
> character sequences that would represent UTF16 or UTF32 characters
> directly.

I had suspected that, especially also because of:
"The encoded values associated with the members of the portable
character set are each represented in a single byte."

But then stumbled over 3.251 Null Wide-Character Code:
"A wide-character code with all bits set to zero."

What's that then good for? Just for wchars, which *may* very well use
fixed size encodings (with multiple bytes) and in fact are 32 Bits
(UCS-4) in glibc?

But even then, for all syscalls etc... these wide chars would need to
get converted to/from "normal" multibyte chars, which use one byte for
the portable character set chars, and have the invariance of . / CR and
LF.

Does that sound right?


> POSIX states that  will be printed.  If that is the byte
> 0x2E, then your POSIX locale is probably ASCII-based.  But it is also
> possible to have a POSIX conforming environment where the POSIX
> locale
> is EBCDIC based, in which case it would print byte 0x4B, but that
> would still be  for all file names and syscalls observable
> from that POSIX environment.

I see... and again, within one implementation of these two, POSIX
wouldn't allow to switch from one to the other.

So that means also, that if I have e.g. my shell script (say in UTF8)
which prints the sentinel via 'printf .', I'm always sure - on any
ASCII-based POSIX system, that regardless of the locale (which would
then need to be ASCII-based as well), '.' would give me 0x2E.

Whereas, when I'd use the same script *as is* on an EBCDIC system, it
would anyway not work out of the box and I'd have to iconv it first to
EBCDIC... and once done, my '.' in there, would always (regardless of
which locale - all of which would need to be EBCDIC-based, too) yield
0x4B?!

And effectively I could *never* run in the situation, that the script
itself is parsed with e.g. ASCII and '.' = 0x2E .. while the shell's
internal LC_ALL has changed to something where '.' would be something
else?!



> > b) print the character 'x' according to the currently set locale,
> > e.g.
> >    if that was using UTF16, it would print the bytes 0x2e 0x00
> 
> It is not possible to have a POSIX locale based on the UTF16
> encoding.
> So this answer is not possible.  While you can write a file with
> characters encoded in UTF16, which when recoded to a multibyte locale
> form a shell script, it is only after you use iconv or fscanf or
> similar to perform that encoding conversion before it actually
> becomes
> a shell script (since sh is documented as being able to reject files
> containing NUL bytes as not being a shell script).  POSIX does not
> allow you to execute a file encoded in UTF16 as a shell script.

Okay, clear now... at least or UTF16/32 ...

But say we have a multibyte based locale foo ... in which some
character X (symbolic name, not the literal X) has one encoding A'...
and another multibyte based locale bar in which X has another encoding
A''.

I thought to remember that I read somewhere that then the encoding in
which the shell parses the file (i.e. in which it was started itself)
would be used.
So if the shell was started in A', even if it then switches to A'' and
it's variables and so would be interpreted according to A'',... the
literals would continue to get A'.

But I cannot really find it in POSIX itself.


> 
> > d) Would it in some weird encodings like IBM905 cause the byte 0x4B
> > to
> >    be printed?
> 
> If you are running on an IBM machine where the POSIX locale is based
> on EBCDIC, then it will indeed print the byte 0x4B.  But it will
> still
> be , as detected by all other processes reached from that
> POSIX environment (and that system will necessarily by unable to have
> an ASCII or UT

Re: how do to cmd subst with trailing newlines portable (was: does POSIX mandate whether the output…)

2022-02-08 Thread Eric Blake via austin-group-l at The Open Group
On Tue, Feb 08, 2022 at 06:53:50AM +0100, Christoph Anton Mitterer via 
austin-group-l at The Open Group wrote:
> Hey.
> 
> I'm afraid but some more questions came up on my side:
> 
> 
> 1) POSIX says:
> "The encoded values associated with , , , and
>  shall be invariant across all locales supported by
> the implementation."
> 
> When now, for example,  is encoded as the byte 0x2E ... the
> consequence would be that it had to be 0x2E in all locales and their
> encodings, right?

Yes. And another fallout of that requirement: you cannot have a single
POSIX system supporting both ASCII and EBCDIC locales.  You can have
iconv and dd support for converting files between the two encodings,
but only one of those two encodings can match your current locale (all
syscalls, all filenames, and so forth, are tied to the current
encoding in use by the POSIX locale, whether that encoding be ASCII,
EBCDIC, or something else).  Any means for choosing which of those two
encodings is treated as the basis of the POSIX locale when starting a
subtree of processes that interact as a POSIX environment would be
vendor-specific interfaces outside of POSIX.

> 
> Doesn't that also mean that POSIX effectively forbids UTF16 or UTF32
> and actually any >1-byte fixed-encoding?
> Cause there it would have to be "padded" with 0x00?

Correct - a POSIX environment cannot use UTF16 or UTF32 encodings as
its basis.  Again, iconv and wide-character library calls (such as
wprintf) can support conversion of files into and out of those
encodings, but that is only file contents; all file names, syscalls,
and other aspects of the POSIX environment for cross-process
communication outside of file contents will use multi-byte encodings
where no multi-byte sequence has an embedded 0x00 byte, and NOT wide
character sequences that would represent UTF16 or UTF32 characters
directly.

> 2) When I have a shell script in some encoding, and it contains e.g.:
>   printf '.'
> would POSIX demand that this:
> a) always cause the byte 0x2E to be printed

POSIX states that  will be printed.  If that is the byte
0x2E, then your POSIX locale is probably ASCII-based.  But it is also
possible to have a POSIX conforming environment where the POSIX locale
is EBCDIC based, in which case it would print byte 0x4B, but that
would still be  for all file names and syscalls observable
from that POSIX environment.

> b) print the character 'x' according to the currently set locale, e.g.
>if that was using UTF16, it would print the bytes 0x2e 0x00

It is not possible to have a POSIX locale based on the UTF16 encoding.
So this answer is not possible.  While you can write a file with
characters encoded in UTF16, which when recoded to a multibyte locale
form a shell script, it is only after you use iconv or fscanf or
similar to perform that encoding conversion before it actually becomes
a shell script (since sh is documented as being able to reject files
containing NUL bytes as not being a shell script).  POSIX does not
allow you to execute a file encoded in UTF16 as a shell script.

> c) print the character 'x' according to the locale in which the shell
>parses the script (but there again, if it was UTF16... the bytes
>0x2e 0x00)

The shell is not required to parse UTF16, because the POSIX locale
cannot be based on UTF16.

> d) Would it in some weird encodings like IBM905 cause the byte 0x4B to
>be printed?

If you are running on an IBM machine where the POSIX locale is based
on EBCDIC, then it will indeed print the byte 0x4B.  But it will still
be , as detected by all other processes reached from that
POSIX environment (and that system will necessarily by unable to have
an ASCII or UTF8 encoding in any of its locales; you are back to
having to use an extension outside of POSIX if you want to start a new
subtree of processes based on an ASCII base encoding).

> 
> 3) With respect to the command substitution with trailing newlines
> question:
> 
> Because of (2) ... would it be in any way safer to e.g.
>   printf '\056'
> (octal for . in ASCII/etc.)
> and also strip that off... rather than using '.'?

Actually, it is less portable.  \056 is a particular byte value, but
unless you know your POSIX locale is ASCII-based, you don't know
whether that byte value is , or some other character, and
there are some POSIX-feasible locales where some single-byte
characters (such as 'A') may also appear in a multibyte-character
sequence.

> 
> Especially also with respect to a hypothetical UTF16/32 locale?

There is no such locale.

> 
> 4) Doesn't strictly belong here, but maybe someone knows:
> On my Debian (=> glibc) I was trying this:
> /usr/share/i18n/charmaps$ zgrep "[xX]2[eEfF]" * | grep -Ev 
> '[[:space:]](SOLIDUS|FULL STOP)$'
> 
> i.e. searching for any entries that are 0x2E or 0x2f ( . and / ),
> filtering out any who really are considered as that.
> 
> That gave quite some matches:
> BRF.gz: /x2e BRAILLE PATTERN DOTS-46
> BRF.gz: /x2f BRAILLE PA

[Issue 8 drafts 0001561]: clarify what kind of data shell variables need to be able to hold

2022-02-08 Thread Austin Group Bug Tracker via austin-group-l at The Open Group


A NOTE has been added to this issue. 
== 
https://www.austingroupbugs.net/view.php?id=1561 
== 
Reported By:calestyo
Assigned To:
== 
Project:Issue 8 drafts
Issue ID:   1561
Category:   Shell and Utilities
Type:   Enhancement Request
Severity:   Editorial
Priority:   normal
Status: New
Name:   Christoph Anton Mitterer 
Organization:
User Reference:  
Section:various 
Page Number:N/A 
Line Number:N/A 
Final Accepted Text: 
== 
Date Submitted: 2022-02-01 00:10 UTC
Last Modified:  2022-02-08 15:14 UTC
== 
Summary:clarify what kind of data shell variables need to be
able to hold
== 

-- 
 (0005668) calestyo (reporter) - 2022-02-08 15:14
 https://www.austingroupbugs.net/view.php?id=1561#c5668 
-- 
Not sure whether that's relevant, but 2.12. Shell Execution Environment
says:

"Variables with the export attribute, along with those explicitly exported
for the duration of the command, shall be passed to the utility environment
variables"

Doesn't that kinda rule out, that the shell may pass on any variables in
its own environment that haven't had a valid Name (in the sense of POSIX)
to any executed programs?


- "Variables with the export attribute" => those are all shell variables
- "along with those explicitly exported for the duration of the command" =>
those are the variable assignments on the command.

- it doesn't rule out explicitly that no others "shall be passed" on... but
one could implicitly deduce that (as those environment variables with
non-Names are already part of POSIX and not just some vendor extension...
so why should POSIX not list them as to be exported or at least call it
"unspecified" ... if it were so?) 

Issue History 
Date ModifiedUsername   FieldChange   
== 
2022-02-01 00:10 calestyo   New Issue
2022-02-01 00:10 calestyo   Name  => Christoph Anton
Mitterer
2022-02-01 00:10 calestyo   Section   => various 
2022-02-01 00:10 calestyo   Page Number   => N/A 
2022-02-01 00:10 calestyo   Line Number   => N/A 
2022-02-01 19:33 mirabilos  Note Added: 0005645  
2022-02-01 19:44 calestyo   Note Added: 0005647  
2022-02-01 20:52 chet_ramey Note Added: 0005649  
2022-02-01 23:07 kreNote Added: 0005650  
2022-02-02 15:15 chet_ramey Note Added: 0005652  
2022-02-02 16:39 calestyo   Note Added: 0005653  
2022-02-02 18:44 kreNote Added: 0005654  
2022-02-06 11:18 mirabilos  Note Added: 0005662  
2022-02-06 18:18 chet_ramey Note Added: 0005665  
2022-02-06 23:17 kreNote Added: 0005666  
2022-02-08 15:14 calestyo   Note Added: 0005668  
==