Re: bash variable names do not comply w/POSIX character set rules

2015-12-06 Thread Eduardo A . Bustamante López
This definition (
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_230
) states:

  3.230 Name

  In the shell command language, a word consisting solely of underscores,
  digits, and alphabetics from the portable character set. The first character
  of a name is not a digit.

This document has a table of the characters included in the portable character
set:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_01

Which is a subset of the unicode alphanumerics.

So no, it does not mandate arbitrary unicode alphabetics. Only the ones listed
there.



Re: bash variable names do not comply w/POSIX character set rules

2015-12-06 Thread Linda Walsh



Eduardo A. Bustamante López wrote:

 This definition (


http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html#tag_03_230

 ) states:

  3.230 Name

  In the shell command language, a word consisting solely of underscores,
  digits, and alphabetics from the portable character set. The first 

character

  of a name is not a digit.


  (1) -- It appears you /accidently/ left out part of the text under
section 3.230.  The full text:


 3.230 Name

 In the shell command language, a word consisting solely of
 underscores, digits, and alphabetics from the portable character
 set. The first character of a name is not a digit.

 Note: The Portable Character Set is defined in detail in
 P̲o̲r̲t̲a̲b̲l̲e̲ ̲C̲h̲a̲r̲a̲c̲t̲e̲r̲ ̲S̲e̲t̲⁽¹⁾

[§̲⁽¹⁾ 
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap06.html#tag_06_01 
]

 3.231 ...[next section]



   Thank-you.  This slightly clarifies matters as it only
requires the POSIX source.  At the location pointed to by
the hyper-link for "Portable Character Set" under section 6.1
sentences 2-4, it states:


 Each supported locale shall include the portable character set,
 which is the set of symbolic names for characters in Portable
 Character Set. This is used to describe characters within the text
 of IEEE Std 1003.1-2001. The first eight entries in Portable
 Character Set are defined in the ISO/IEC 6429:1992 standard and
 the rest of the characters are defined in the ISO/IEC 10646-1:2000
 standard. 


   FWIW, in full disclosure, in the last dotted paragraph before the
last sentence of section 6.1, there is a requirement that the alphabetic
character fit within 1 byte -- i.e. only characters in what is commonly
called the "Extended ASCII character set" (ex. ISO-8859-1) seem to be
required.  Note, the character 'Ø' is 1 byte.  So, as the quoted
section above mentions using [basically], the Unicode table for "symbolic
names", it doesn't prescribe a specific encoding. I.e. - While the
reference is to ISO-10646 (Unicode), it does not require a
specific encoding. 


   For Unicode values 0-255, ISO-8859-1 encodes the first 256
bytes of Unicode with 1 byte (satisfying the 1-byte posix constraint,
though it is not able to encode Unicode values >=256, which makes
posix's reference to ISO-10646 somewhat specious as only the 1st
256 values can be encoded in 1 byte (that I am aware of).

   Nevertheless, the symbolic name "LATIN CAPITAL LETTER O WITH STROKE
(o slash)" or 'U+00D8' is classified as an alphabetic, which is a subset
of the "alphanumeric" requirement of POSIX. 


   Note under section 9.3.5 "RE Bracket Expression", subsection 6:


 The following character class expressions shall be supported in
 all locales:

 [:alnum:]   [:cntrl:]   [:lower:]   [:space:]
 [:alpha:]   [:digit:]   [:print:]   [:upper:]
 [:blank:]   [:graph:]   [:punct:]   [:xdigit:]

 In addition, character class expressions of the form:

 [:name:]

 are recognized in those locales where the name keyword has been
 given a charclass definition in the LC_CTYPE category.


Note that "aØb" is classified as fully "alphabetic" by bash's
character-class matching facility -- whether in UTF-8 or ISO-8859-1:


 echo $LC_CTYPE

en_US.ISO-8859-1
LC_CTYPE=en_US.UTF-8
...

 declare -xg a=$(echo -n $'\x61\xd8\x62')
 declare -xg b=${a}c
 [[ $a =~ ^[[:alpha:]]+$ ]] && echo alpha

alpha
 [[ $a =~ ^[[:alnum:]]+$ ]] && echo alnum   

alnum

 [[ $b =~ ^[[:alpha:]]+$ ]] && echo alpha

alpha

 [[ $b =~ ^[[:alnum:]]+$ ]] && echo alnum

alnum

Notice bash classifies the string "aØb" as an alphanumeric AND
as an alphabetic character.  I.e.  bash, itself, claims that
"aØb" is a valid identifier.

Also note, it accepts "aØb" as a var and as an environment var
when used indirectly:


 declare -xg $a='a"slash-O"b'
 declare -xg $b='ab"slash-O"c'
 env|/usr/bin/grep -P '^[ab]...?'|hexdump -C

  61 d8 62 63 3d 61 62 22  73 6c 61 73 68 2d 4f 22  |aab"slash-O"|
0010  63 0a 61 d8 62 3d 61 22  73 6c 61 73 68 2d 4f 22  |c.a"slash-O"|
0020  62 0a 61 3d 61 d8 62 0a  62 3d 61 d8 62 63 0a |b.a=a=a|
002f

===



...
 So no, it does not mandate arbitrary unicode alphabetics. Only the 

ones listed

 there.


   Thank-you.  This better makes the case, as it only refers to
the POSIX reference pages.  But it seems that it boils down to the
allowed definition of envirionment variables:
(http://pubs.opengroup.org/onlinepubs/9699919799/)


 2.5.3 Shell Variables



 Variables shall be initialized from the environment (as defined by
 XBD Environment Variables and the exec function in the System
 Interfaces volume of POSIX.1-2008) and can be given new values
 with variable assignment commands.



The XBD interface is a description of API facilities for programs to use --
not an end-user-interface.  In particular, it says: (under section 8.1)

(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08)


 Environment variable names used by the utilities in the 

bash variable names do not comply w/POSIX character set rules

2015-12-05 Thread Linda Walsh




Under section 2.5.3, Shell Variables, it mentions:

LC_CTYPE
   Determine the interpretation of sequences of bytes of text data as 
characters (for example, single-byte as opposed to multi-byte 
characters), which characters are defined as letters (character class 
alpha) and  characters (character class blank), and the behavior 
of character classes within pattern matching.


If I have an LC_CTYPE set to UTF-8, then the rules in unicode as
to how the character is defined (alpha, numeric, alphanumeric, etc...)
seem appropriate to use.

In the bash man page, there is a definition of 'name':
  name   A word consisting only of  alphanumeric  characters  and  under-
 scores,  and beginning with an alphabetic character or an under-
 score.  Also referred to as an identifier.

However, I was looking for a char to visually separate
a "class" and a var in the class (would have liked something
like a.b, but "." isn't alpha numeric), but
"LATIN CAPITAL LETTER O WITH STROKE" (U+00D8), is alphabetic,
but doesn't work:

 aØb=1

-bash: aØb=1: command not found

The POSIX portable character set:
6. Character Set
6.1 Portable Character Set

Conforming implementations shall support one or more coded character 
sets. Each supported locale shall include the portable character set, 
which is the set of symbolic names for characters in Portable Character 
Set. This is used to describe characters within the text of 
POSIX.1-2008. The first eight entries in Portable Character Set are 
defined in the ISO/IEC 6429:1992 standard and the rest of the characters 
are defined in the ISO/IEC 10646-1:2000 standard.


ISO10646 = Unicode -- I.e. Posix appears to base its definition of
alphanumeric characters, for example, on the Unicode character set.

So, theoretically, any alphanumeric class char from Unicode should work
as described in the bash manpages, to compose a "name" (variable or
subroutine name), but this doesn't seem to be the case.

I know this isn't a trivial POSIX requirement to meet, but given
Gnu and bash's changes in the shell and unix command behavior, it
seems support of the character set would be the foundation of POSIX
compatibility.

It it were me, I'd probably try to look at the perl-handling (imperfect
as it may be) for unicode -- which has had alot of work put into it and
may be one of the more complete and up-to-date implementations for unicode
character handling.  I'd try to see if there was any part that might
either give ideas for bringing bash into compliance or any code that
might provide a pattern for implementation.  But investigating it further
might yield other, better options for bash.  Dunno.

Is this something that's even been thought about or is planned for?

Thanks!
-Linda