Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Chet Ramey
On 11/24/18 4:32 PM, Bize Ma wrote:

> > Bash is removing characters not explicitly listed in a bracket
> > expression (character range).
> > In this example, it is removing digits from other languages.
> 
> What is your locale?
> 
>  
> The locale used was en_US.utf-8 but also happens with  459
> locales out of 868 available under Debian (not in C, for example).
> 
> Also in all locales affected (except one), setting either
> LC_ALL=$loc or LC_COLLATE=$loc did the same.
> Except in zh_CN.gb18030
> 
> But IMO locale collation should not be used for an explicit list.

Collation order is used for each individual character in a bracket
expression when compared against the string, as posix specifies.

> I have been made aware that there is a
>   cstart = cend = FOLD (cstart);
> inside the `sm_loop.c` file that will convert into a range many
> individual character. If that understanding is correct that is the
> source of the difference with other shells.

I'm not sure what you mean by "convert into a range." If cstart and cend
were treated as a range, the start end and end characters would be the
same. If cstart == cend, a character that collates >= cstart and <= cend
would have to collate equal to cstart and cend.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Chet Ramey
On 11/28/18 2:05 AM, Bize Ma wrote:
> Chet Ramey (mailto:chet.ra...@case.edu>>) wrote:
>  
> 
> I can't reproduce this:
> 
> 
> If you could take a look at https://unix.stackexchange.com/a/483835/265604
> you will see that it has been confirmed on "Ubuntu 17.10 (glibc 2.26) and on
> Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc
> 2.28)"

I must have used systems without this problem.

> It is interesting that (finally) glibc 2.28 has added a fourth sort key
> equal to the
> Unicode code point. That forces the order of all characters to be unique.

One of the POSIX future directions.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Chet Ramey
On 11/28/18 2:45 AM, Bize Ma wrote:
> Chet Ramey (mailto:chet.ra...@case.edu>>) wrote:
> 
> On 11/24/18 2:32 PM, Chet Ramey wrote:
> 
> >> But IMO locale collation should not be used for an explicit list.
> >
> > Collation order is used for each individual character in a bracket
> > expression when compared against the string, as posix specifies.
> 
> 
> Yes, values resulting from a glob expansion should be compared with strcoll.
> 
> How many characters should there be in a range like [0-0] ?
> Or to be more precise: in a [0] bracket expression? one?

There should be one character ("0") that matches as many characters as
collate equal to the character "0", as per the POSIX quote in my previous
message.

> 
> If I were you, I would file a bug report with Debian against wcscoll.
> 
> 
> And I would be told that wcscoll is doing what the collation file 14651 is
> telling it to do.

Sure.

> 
> And, that in any case, that file has been updated in glib2.8 anyway.

That should fix the problem without forcing applications to attempt to
impose a total ordering even when strcoll/wcscoll returns 0.

> It returns 0 (equal) for L"٠" and L"0" without setting errno. That's
> clearly a problem with wcscoll (if the character isn't valid in the 
> current
> locale) or the locale definition.
> 
> 
> Both characters collate to the same position as I have already explained.

Yes, so the locale definition files imposing a total ordering will be a
clear improvement.

> 
> I don't follow you about what you mean with: /(if the character isn't valid
> in the current
> locale)./

There are codepoints that correspond to characters in one locale but don't
map to a valid character in another.

Chet
-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Bize Ma
Chet Ramey () wrote:


> I can't reproduce this:
>

If you could take a look at https://unix.stackexchange.com/a/483835/265604
you will see that it has been confirmed on "Ubuntu 17.10 (glibc 2.26) and on
Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc
2.28)"

It is interesting that (finally) glibc 2.28 has added a fourth sort key
equal to the
Unicode code point. That forces the order of all characters to be unique.


Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Chet Ramey
On 11/28/18 2:29 AM, Bize Ma wrote:
> Chet Ramey (mailto:chet.ra...@case.edu>>) wrote:
> 
> On 11/24/18 4:32 PM, Bize Ma wrote:
> 
>  [...]
> 
> > I have been made aware that there is a
> >   cstart = cend = FOLD (cstart);
> > inside the `sm_loop.c` file that will convert into a range many
> > individual character. If that understanding is correct that is the
> > source of the difference with other shells.
> 
> I'm not sure what you mean by "convert into a range." If cstart and cend
> were treated as a range, the start end and end characters would be the
> same. If cstart == cend, a character that collates >= cstart and <= cend
> would have to collate equal to cstart and cend.
> 
> 
> Yes, exactly, a range where the start and the end are the same.

A range like that is exactly equivalent to a single ordinary character.

POSIX: "An ordinary character in the list should only match that character,
but may match any single character that collates equally with that
character"

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Bize Ma
Chet Ramey () wrote:

> On 11/24/18 2:32 PM, Chet Ramey wrote:
>
> >> But IMO locale collation should not be used for an explicit list.
> >
> > Collation order is used for each individual character in a bracket
> > expression when compared against the string, as posix specifies.
>

Yes, values resulting from a glob expansion should be compared with strcoll.

How many characters should there be in a range like [0-0] ?
Or to be more precise: in a [0] bracket expression? one?

If I were you, I would file a bug report with Debian against wcscoll.
>

And I would be told that wcscoll is doing what the collation file 14651 is
telling it to do.

And, that in any case, that file has been updated in glib2.8 anyway.


> It returns 0 (equal) for L"٠" and L"0" without setting errno. That's
> clearly a problem with wcscoll (if the character isn't valid in the current
> locale) or the locale definition.
>

Both characters collate to the same position as I have already explained.

I don't follow you about what you mean with:
*(if the character isn't valid in the current locale).*


Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Bize Ma
Chet Ramey () wrote:

> On 11/24/18 4:32 PM, Bize Ma wrote:

 [...]

> > I have been made aware that there is a
> >   cstart = cend = FOLD (cstart);
> > inside the `sm_loop.c` file that will convert into a range many
> > individual character. If that understanding is correct that is the
> > source of the difference with other shells.
>
> I'm not sure what you mean by "convert into a range." If cstart and cend
> were treated as a range, the start end and end characters would be the
> same. If cstart == cend, a character that collates >= cstart and <= cend
> would have to collate equal to cstart and cend.
>

Yes, exactly, a range where the start and the end are the same.

Try:

$ touch 0 1 ٠ ١  ۰ ۱ ߀ ߁ ० १
$ echo [1]
1  ١

It is converted to the same range as this

$ echo [1-1]
1  ١

That happens because up to glibc 2.27 this has been the collation order of
those characters (search in /usr/share/i18n/locales/iso14651_t1_common) :

 <0>;;;IGNORE
 <0>;;;IGNORE

Collate to exactly the same values. This breaks the capacity to detect that
a character is absent in a list ordered by the collation order.


Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-03 Thread Chet Ramey
On 11/24/18 2:32 PM, Chet Ramey wrote:

>> But IMO locale collation should not be used for an explicit list.
> 
> Collation order is used for each individual character in a bracket
> expression when compared against the string, as posix specifies.
> 
>> I have been made aware that there is a
>>   cstart = cend = FOLD (cstart);
>> inside the `sm_loop.c` file that will convert into a range many
>> individual character. If that understanding is correct that is the
>> source of the difference with other shells.
> 
> I'm not sure what you mean by "convert into a range." If cstart and cend
> were treated as a range, the start end and end characters would be the
> same. If cstart == cend, a character that collates >= cstart and <= cend
> would have to collate equal to cstart and cend.

If I were you, I would file a bug report with Debian against wcscoll.

It returns 0 (equal) for L"٠" and L"0" without setting errno. That's
clearly a problem with wcscoll (if the character isn't valid in the current
locale) or the locale definition.

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/




Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-12-01 Thread Bize Ma
Chet Ramey () wrote:


> I am going to forward the rest of our exchange to bug-bash;


Great!


> you left the mailing list off your replies for some reason.
>
> Chet
>

Since it was *you* who directed the answer only to me, I assumed
(now seemingly incorrectly) that that was what you wanted to do.


Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-11-24 Thread Bize Ma
Chet Ramey () wrote:

> On 11/23/18 6:09 PM, Bize Ma wrote:
>
> > Bash Version: 4.4
> > Patch Level: 12
> > Release Status: release
>


> > Description:
> >
> > Bash is removing characters not explicitly listed in a bracket
> > expression (character range).
> > In this example, it is removing digits from other languages.
>
> What is your locale?
>
>
The locale used was en_US.utf-8 but also happens with  459
locales out of 868 available under Debian (not in C, for example).

Also in all locales affected (except one), setting either
LC_ALL=$loc or LC_COLLATE=$loc did the same.
Except in zh_CN.gb18030

But IMO locale collation should not be used for an explicit list.

I have been made aware that there is a
  cstart = cend = FOLD (cstart);
inside the `sm_loop.c` file that will convert into a range many
individual character. If that understanding is correct that is the
source of the difference with other shells.

I have the perception that a collation table *must have a "total order"*,
in fact, an strict total order. If two characters `a` and `b` could sort as
equal the order will fail to provide a confirmation that a character is
absent from the list. Consider characters `a`, `b` and `c`, if a and b
sort as equal, a sorted list in which we find `a` followed by `c` doesn't
confirm that `b` is absent as the order could well be `b a c`.

In this case, there must not be any other character than `a` in the
range `a-a` and using a range `a-a` is equivalent (just slower and
more complex) to the single character `a`.

If this is not the case, the error is in the collation table, not in using
single (faster) characters. And what should be updated is such
collation table IMO.


Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-11-24 Thread Chet Ramey
On 11/23/18 6:09 PM, Bize Ma wrote:

> Bash Version: 4.4
> Patch Level: 12
> Release Status: release
> 
> 
> 
> Description:
> 
> Bash is removing characters not explicitly listed in a bracket
> expression (character range).
> In this example, it is removing digits from other languages.
> 
> Also tested (and it fails) in bash 3.{0,1,3} 4.{1,2,3} and 5.0
> Not a problem in bash 2.{0,1}

I can't reproduce this:

$ cat ./x4

a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'

recho "${a}"
recho "${a//[0123456789]}"
$ ../bash-5.0-beta2/bash ./x4
argv[1] = <0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९>
argv[1] = < ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९>
$ ../bash-4.4-patched/bash ./x4
argv[1] = <0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९>
argv[1] = < ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९>


-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Re: Bash removes unrequested characters in bracket expressions (not a range).

2018-11-24 Thread Chet Ramey
On 11/23/18 6:09 PM, Bize Ma wrote:

> Bash Version: 4.4
> Patch Level: 12
> Release Status: release
> 
> 
> 
> Description:
> 
> Bash is removing characters not explicitly listed in a bracket
> expression (character range).
> In this example, it is removing digits from other languages.

What is your locale?

-- 
``The lyf so short, the craft so long to lerne.'' - Chaucer
 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/



Bash removes unrequested characters in bracket expressions (not a range).

2018-11-23 Thread Bize Ma
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: Linux-gnu
Compiler: gcc
Compilation CFLAGS:  -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64'
-DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu'
-DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash'
-DSHELL -DHAVE_CONFIG_H   -I.  -I../. -I.././include -I.././lib
-Wdate-time -D_FORTIFY_SOURCE=2 -g -O2
-fdebug-prefix-map=/build/bash-7fckc0/bash-4.4=.
-fstack-protector-strong -Wformat -Werror=format-security -Wall
-no-pie -Wno-parentheses -Wno-format-security
uname output: Linux io 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2
(2018-10-27) x86_64 GNU/Linux
Machine Type: x86_64-pc-linux-gnu

Bash Version: 4.4
Patch Level: 12
Release Status: release



Description:

Bash is removing characters not explicitly listed in a bracket
expression (character range).
In this example, it is removing digits from other languages.

Also tested (and it fails) in bash 3.{0,1,3} 4.{1,2,3} and 5.0
Not a problem in bash 2.{0,1}



Repeat-By:

If the characters are a problem: please visit:
https://unix.stackexchange.com/q/483743/265604

$ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९'
$ echo "${a//[0123456789]}"
  ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९