Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/24/18 4:32 PM, Bize Ma wrote: > > Bash is removing characters not explicitly listed in a bracket > > expression (character range). > > In this example, it is removing digits from other languages. > > What is your locale? > > > The locale used was en_US.utf-8 but also happens with 459 > locales out of 868 available under Debian (not in C, for example). > > Also in all locales affected (except one), setting either > LC_ALL=$loc or LC_COLLATE=$loc did the same. > Except in zh_CN.gb18030 > > But IMO locale collation should not be used for an explicit list. Collation order is used for each individual character in a bracket expression when compared against the string, as posix specifies. > I have been made aware that there is a > cstart = cend = FOLD (cstart); > inside the `sm_loop.c` file that will convert into a range many > individual character. If that understanding is correct that is the > source of the difference with other shells. I'm not sure what you mean by "convert into a range." If cstart and cend were treated as a range, the start end and end characters would be the same. If cstart == cend, a character that collates >= cstart and <= cend would have to collate equal to cstart and cend. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/28/18 2:05 AM, Bize Ma wrote: > Chet Ramey (mailto:chet.ra...@case.edu>>) wrote: > > > I can't reproduce this: > > > If you could take a look at https://unix.stackexchange.com/a/483835/265604 > you will see that it has been confirmed on "Ubuntu 17.10 (glibc 2.26) and on > Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc > 2.28)" I must have used systems without this problem. > It is interesting that (finally) glibc 2.28 has added a fourth sort key > equal to the > Unicode code point. That forces the order of all characters to be unique. One of the POSIX future directions. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/28/18 2:45 AM, Bize Ma wrote: > Chet Ramey (mailto:chet.ra...@case.edu>>) wrote: > > On 11/24/18 2:32 PM, Chet Ramey wrote: > > >> But IMO locale collation should not be used for an explicit list. > > > > Collation order is used for each individual character in a bracket > > expression when compared against the string, as posix specifies. > > > Yes, values resulting from a glob expansion should be compared with strcoll. > > How many characters should there be in a range like [0-0] ? > Or to be more precise: in a [0] bracket expression? one? There should be one character ("0") that matches as many characters as collate equal to the character "0", as per the POSIX quote in my previous message. > > If I were you, I would file a bug report with Debian against wcscoll. > > > And I would be told that wcscoll is doing what the collation file 14651 is > telling it to do. Sure. > > And, that in any case, that file has been updated in glib2.8 anyway. That should fix the problem without forcing applications to attempt to impose a total ordering even when strcoll/wcscoll returns 0. > It returns 0 (equal) for L"٠" and L"0" without setting errno. That's > clearly a problem with wcscoll (if the character isn't valid in the > current > locale) or the locale definition. > > > Both characters collate to the same position as I have already explained. Yes, so the locale definition files imposing a total ordering will be a clear improvement. > > I don't follow you about what you mean with: /(if the character isn't valid > in the current > locale)./ There are codepoints that correspond to characters in one locale but don't map to a valid character in another. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
Chet Ramey () wrote: > I can't reproduce this: > If you could take a look at https://unix.stackexchange.com/a/483835/265604 you will see that it has been confirmed on "Ubuntu 17.10 (glibc 2.26) and on Ubuntu 18.04 (glibc 2.27), but it seems to be fixed on Ubuntu 18.10 (glibc 2.28)" It is interesting that (finally) glibc 2.28 has added a fourth sort key equal to the Unicode code point. That forces the order of all characters to be unique.
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/28/18 2:29 AM, Bize Ma wrote: > Chet Ramey (mailto:chet.ra...@case.edu>>) wrote: > > On 11/24/18 4:32 PM, Bize Ma wrote: > > [...] > > > I have been made aware that there is a > > cstart = cend = FOLD (cstart); > > inside the `sm_loop.c` file that will convert into a range many > > individual character. If that understanding is correct that is the > > source of the difference with other shells. > > I'm not sure what you mean by "convert into a range." If cstart and cend > were treated as a range, the start end and end characters would be the > same. If cstart == cend, a character that collates >= cstart and <= cend > would have to collate equal to cstart and cend. > > > Yes, exactly, a range where the start and the end are the same. A range like that is exactly equivalent to a single ordinary character. POSIX: "An ordinary character in the list should only match that character, but may match any single character that collates equally with that character" -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
Chet Ramey () wrote: > On 11/24/18 2:32 PM, Chet Ramey wrote: > > >> But IMO locale collation should not be used for an explicit list. > > > > Collation order is used for each individual character in a bracket > > expression when compared against the string, as posix specifies. > Yes, values resulting from a glob expansion should be compared with strcoll. How many characters should there be in a range like [0-0] ? Or to be more precise: in a [0] bracket expression? one? If I were you, I would file a bug report with Debian against wcscoll. > And I would be told that wcscoll is doing what the collation file 14651 is telling it to do. And, that in any case, that file has been updated in glib2.8 anyway. > It returns 0 (equal) for L"٠" and L"0" without setting errno. That's > clearly a problem with wcscoll (if the character isn't valid in the current > locale) or the locale definition. > Both characters collate to the same position as I have already explained. I don't follow you about what you mean with: *(if the character isn't valid in the current locale).*
Re: Bash removes unrequested characters in bracket expressions (not a range).
Chet Ramey () wrote: > On 11/24/18 4:32 PM, Bize Ma wrote: [...] > > I have been made aware that there is a > > cstart = cend = FOLD (cstart); > > inside the `sm_loop.c` file that will convert into a range many > > individual character. If that understanding is correct that is the > > source of the difference with other shells. > > I'm not sure what you mean by "convert into a range." If cstart and cend > were treated as a range, the start end and end characters would be the > same. If cstart == cend, a character that collates >= cstart and <= cend > would have to collate equal to cstart and cend. > Yes, exactly, a range where the start and the end are the same. Try: $ touch 0 1 ٠ ١ ۰ ۱ ߀ ߁ ० १ $ echo [1] 1 ١ It is converted to the same range as this $ echo [1-1] 1 ١ That happens because up to glibc 2.27 this has been the collation order of those characters (search in /usr/share/i18n/locales/iso14651_t1_common) : <0>;;;IGNORE <0>;;;IGNORE Collate to exactly the same values. This breaks the capacity to detect that a character is absent in a list ordered by the collation order.
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/24/18 2:32 PM, Chet Ramey wrote: >> But IMO locale collation should not be used for an explicit list. > > Collation order is used for each individual character in a bracket > expression when compared against the string, as posix specifies. > >> I have been made aware that there is a >> cstart = cend = FOLD (cstart); >> inside the `sm_loop.c` file that will convert into a range many >> individual character. If that understanding is correct that is the >> source of the difference with other shells. > > I'm not sure what you mean by "convert into a range." If cstart and cend > were treated as a range, the start end and end characters would be the > same. If cstart == cend, a character that collates >= cstart and <= cend > would have to collate equal to cstart and cend. If I were you, I would file a bug report with Debian against wcscoll. It returns 0 (equal) for L"٠" and L"0" without setting errno. That's clearly a problem with wcscoll (if the character isn't valid in the current locale) or the locale definition. -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
Chet Ramey () wrote: > I am going to forward the rest of our exchange to bug-bash; Great! > you left the mailing list off your replies for some reason. > > Chet > Since it was *you* who directed the answer only to me, I assumed (now seemingly incorrectly) that that was what you wanted to do.
Re: Bash removes unrequested characters in bracket expressions (not a range).
Chet Ramey () wrote: > On 11/23/18 6:09 PM, Bize Ma wrote: > > > Bash Version: 4.4 > > Patch Level: 12 > > Release Status: release > > > Description: > > > > Bash is removing characters not explicitly listed in a bracket > > expression (character range). > > In this example, it is removing digits from other languages. > > What is your locale? > > The locale used was en_US.utf-8 but also happens with 459 locales out of 868 available under Debian (not in C, for example). Also in all locales affected (except one), setting either LC_ALL=$loc or LC_COLLATE=$loc did the same. Except in zh_CN.gb18030 But IMO locale collation should not be used for an explicit list. I have been made aware that there is a cstart = cend = FOLD (cstart); inside the `sm_loop.c` file that will convert into a range many individual character. If that understanding is correct that is the source of the difference with other shells. I have the perception that a collation table *must have a "total order"*, in fact, an strict total order. If two characters `a` and `b` could sort as equal the order will fail to provide a confirmation that a character is absent from the list. Consider characters `a`, `b` and `c`, if a and b sort as equal, a sorted list in which we find `a` followed by `c` doesn't confirm that `b` is absent as the order could well be `b a c`. In this case, there must not be any other character than `a` in the range `a-a` and using a range `a-a` is equivalent (just slower and more complex) to the single character `a`. If this is not the case, the error is in the collation table, not in using single (faster) characters. And what should be updated is such collation table IMO.
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/23/18 6:09 PM, Bize Ma wrote: > Bash Version: 4.4 > Patch Level: 12 > Release Status: release > > > > Description: > > Bash is removing characters not explicitly listed in a bracket > expression (character range). > In this example, it is removing digits from other languages. > > Also tested (and it fails) in bash 3.{0,1,3} 4.{1,2,3} and 5.0 > Not a problem in bash 2.{0,1} I can't reproduce this: $ cat ./x4 a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९' recho "${a}" recho "${a//[0123456789]}" $ ../bash-5.0-beta2/bash ./x4 argv[1] = <0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९> argv[1] = < ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९> $ ../bash-4.4-patched/bash ./x4 argv[1] = <0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९> argv[1] = < ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९> -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Re: Bash removes unrequested characters in bracket expressions (not a range).
On 11/23/18 6:09 PM, Bize Ma wrote: > Bash Version: 4.4 > Patch Level: 12 > Release Status: release > > > > Description: > > Bash is removing characters not explicitly listed in a bracket > expression (character range). > In this example, it is removing digits from other languages. What is your locale? -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, UTech, CWRUc...@case.eduhttp://tiswww.cwru.edu/~chet/
Bash removes unrequested characters in bracket expressions (not a range).
Configuration Information [Automatically generated, do not change]: Machine: x86_64 OS: Linux-gnu Compiler: gcc Compilation CFLAGS: -DPROGRAM='bash' -DCONF_HOSTTYPE='x86_64' -DCONF_OSTYPE='linux-gnu' -DCONF_MACHTYPE='x86_64-pc-linux-gnu' -DCONF_VENDOR='pc' -DLOCALEDIR='/usr/share/locale' -DPACKAGE='bash' -DSHELL -DHAVE_CONFIG_H -I. -I../. -I.././include -I.././lib -Wdate-time -D_FORTIFY_SOURCE=2 -g -O2 -fdebug-prefix-map=/build/bash-7fckc0/bash-4.4=. -fstack-protector-strong -Wformat -Werror=format-security -Wall -no-pie -Wno-parentheses -Wno-format-security uname output: Linux io 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux Machine Type: x86_64-pc-linux-gnu Bash Version: 4.4 Patch Level: 12 Release Status: release Description: Bash is removing characters not explicitly listed in a bracket expression (character range). In this example, it is removing digits from other languages. Also tested (and it fails) in bash 3.{0,1,3} 4.{1,2,3} and 5.0 Not a problem in bash 2.{0,1} Repeat-By: If the characters are a problem: please visit: https://unix.stackexchange.com/q/483743/265604 $ a='0123456789 ٠١٢٣٤٥٦٧٨٩ ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९' $ echo "${a//[0123456789]}" ۰۱۲۳۴۵۶۷۸۹ ߀߁߂߃߄߅߆߇߈߉ ०१२३४५६७८९