Re: Problem with Bash regex test case sensitivity
On 12/3/10, Eric Blake eblake@ wrote: On 12/03/2010 07:11 PM, Lee wrote: Or, is this a bug? No, but a feature of your locale. Set 'export LC_COLLATE=C', and use LANG rather than LC_ALL for all your other locale defaults, in your ~/.bashrc if you don't like it. Nice tip - thank you. But is there a reason I'd want LANG set to en_US.UTF-8 instead of C.UTF-8? As far as I can tell, everything works for me with LANG=C.UTF-8. Other than changing the collating sequence to something I don't want, what does LANG=en_US.UTF-8 get me that LANG=C.UTF-8 doesn't? as long as I'm showing how ignorant I am... why put the local defaults in ~/.bashrc? My understanding is that ~/.bashrc is called at every shell startup. Seems like that's one of those things that just needs to be set in the login shell, so wouldn't ~/.bash_profile be more appropriate for the locale settings? Welcome to the new world order :-0 I tried to figure out why the collating sequence changes with the language settings but didn't get anywhere beyond the fact that it _does_ change. Oh well.. try, try again. Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. Which says the en_US locale collates the upper and lower case letters like this: AaBb...Zz I got that much :) What I don't get is why someone would _want_ the collating sequence to be AaBb... or why that sequence was picked for en_US instead of using the natural order of A-Za-z. Regards, Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On Dec 4 10:05, Lee wrote: On 12/3/10, Eric Blake eblake@ wrote: Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. Which says the en_US locale collates the upper and lower case letters like this: AaBb...Zz I got that much :) What I don't get is why someone would _want_ the collating sequence to be AaBb... or why that sequence was picked for en_US instead of using the natural order of A-Za-z. It's not the natural order, it's an arbitrary order which has been chosen back in 1963 when the ASCII code has been defined. It's not used as natural order outside of computer systems and it's not even the natural order on some computer systems (See EBCDIC). If you take a look into a hardcopy encyclopedia written in english, you'll be very comfortable that the words are ordered lexicographically instead of in ASCII coding, probably. Needless to say that ordering criteria for non-english languages may contain more characters in the sequence, in german for instance AaäBb...Ooö...Ssß...Uuü...Zz So, let's reiterate: - If I need the order for the computer language, I say so: LC_COLLATE=C.UTF-8 - Otherwise, if I need the order for the natural language, I say so: LC_COLLATE=en_US.UTF-8 LC_COLLATE=de_DE.UTF-8 ... Corinna -- Corinna Vinschen Please, send mails regarding Cygwin to Cygwin Project Co-Leader cygwin AT cygwin DOT com Red Hat -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On Sat, Dec 04, 2010 at 10:05:42AM -0400, Lee wrote: On 12/03/2010 07:11 PM, Lee wrote: why put the local defaults in ~/.bashrc? My understanding is that ~/.bashrc is called at every shell startup. Seems like that's one of those things that just needs to be set in the login shell, so wouldn't ~/.bash_profile be more appropriate for the locale settings? (Most probably you already know all of this, but...) As of now, the default settings are provided via /etc/profile: if [ -d /etc/profile.d ]; then while read f; do if [ -f ${f} ]; then . ${f} fi done - EOF `/bin/find -L /etc/profile.d -type f -iname '*.sh' -or -iname '*.zsh' | LC_ALL=C sort` EOF fi which in turn sources /etc/profile.d/lang.sh: # if no locale variable is set, indicate terminal charset via LANG test -z ${LC_ALL:-${LC_CTYPE:-$LANG}} export LANG=C.UTF-8 The bash manual page explains the order in which startup files are read for both login and non-login shells (both interactive and non-interactive). So, given that ~/.bash_profile sources ~/.bashrc, (in our cygwin defaults), that looks like an easy way to set your LANG in a per-user manner, no matter what kind of shell you open. If you want it to be a system-wide setting, you should use /etc/bash.bashrc (for the bash shell, of course). Setting it only in ~/.bash_profile makes it invisible for non-login shells. -- Huella de clave primaria: 0FDA C36F F110 54F4 D42B D0EB 617D 396C 448B 31EB signature.asc Description: Digital signature
Re: Problem with Bash regex test case sensitivity
On 12/4/10, Corinna Vinschen corinna-cygwin wrote: On Dec 4 10:05, Lee wrote: On 12/3/10, Eric Blake eblake@ wrote: Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. Which says the en_US locale collates the upper and lower case letters like this: AaBb...Zz I got that much :) What I don't get is why someone would _want_ the collating sequence to be AaBb... or why that sequence was picked for en_US instead of using the natural order of A-Za-z. It's not the natural order, it's an arbitrary order which has been chosen back in 1963 when the ASCII code has been defined. It's not used as natural order outside of computer systems and it's not even the natural order on some computer systems (See EBCDIC). My idea of natural order is treating each character as an unsigned integer. So even though ASCII has a different collating sequence than EBCDIC, the characters are still treated as unsigned integers when sorting them. Setting LANG to something other than C seems to break that model.. If you take a look into a hardcopy encyclopedia written in english, you'll be very comfortable that the words are ordered lexicographically instead of in ASCII coding, probably. I never paid all that much attention to how the words were ordered, but now that I have.. they're backwards! god comes before God, hopper before Hopper, etc. Needless to say that ordering criteria for non-english languages may contain more characters in the sequence, in german for instance AaäBb...Ooö...Ssß...Uuü...Zz So, let's reiterate: - If I need the order for the computer language, I say so: LC_COLLATE=C.UTF-8 - Otherwise, if I need the order for the natural language, I say so: LC_COLLATE=en_US.UTF-8 LC_COLLATE=de_DE.UTF-8 You're quite good at explaining this.. I think I'm actually beginning to understand it :) So... the reason for setting LANG is a shorthand method of setting all the LC_xxx environment variables? Thanks, Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 12/4/2010 10:06 AM, Corinna Vinschen wrote: On Dec 4 10:05, Lee wrote: On 12/3/10, Eric Blake eblake@ wrote: Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. Which says the en_US locale collates the upper and lower case letters like this: AaBb...Zz I got that much :) What I don't get is why someone would _want_ the collating sequence to be AaBb... or why that sequence was picked for en_US instead of using the natural order of A-Za-z. It's not the natural order, it's an arbitrary order which has been chosen back in 1963 when the ASCII code has been defined. It's not used as natural order outside of computer systems and it's not even the natural order on some computer systems (See EBCDIC). If you take a look into a hardcopy encyclopedia written in english, you'll be very comfortable that the words are ordered lexicographically instead of in ASCII coding, probably. Needless to say that ordering criteria for non-english languages may contain more characters in the sequence, in german for instance AaäBb...Ooö...Ssß...Uuü...Zz So, let's reiterate: - If I need the order for the computer language, I say so: LC_COLLATE=C.UTF-8 - Otherwise, if I need the order for the natural language, I say so: LC_COLLATE=en_US.UTF-8 LC_COLLATE=de_DE.UTF-8 ... Here's my takeaway, given Corinna's interesting and complete context, and my intents. (My intentions, BTW, are for my scripts to have as much generality as possible [given my limited skills ;-|].) Therefore, instead of using '[A-Z]' to represent caps, I should have used (?) the Posixly Correct, '[:upper:]'. However, the test script (attached) still doesn't work on either my Cygwin config, or a Linux config, with this change. (I have not yet made the above indicated environment variable changes, since I am still waiting for clarification to the new issue I bring up, here.) The latter test would, IMHO, seem to imply that the changes to NIX shells were mandated by I18N considerations, BUT the other required changes in code or default setting were NOT implemented. This would seem to penalize only those folks who are conversant with long-term convention of the 'NIX world. Please correct my misunderstanding if I'm wrong! Lee #!/bin/bash # t_regex: Tutorial on regex, test # By Lee Rothstein, 2010-12-04, 13:57:54 # Each Test performed on: # * CYGWIN_NT-6.0-WOW64 1.7.7(0.230/5/3) # * Linux 2.6.15-55-amd64-generic #if [[ $1 =~ [A-Z] ]] ; then # doesn't work if [[ $1 =~ [:upper:] ]] ; then # doesn't work #if [[ $1 =~ [ABCDEFGHIJKLMNOPQRSTUVWXYZ] ]] ; then # Works echo Contains Capital Letters: $1 else echo Doesn\'t Contain Capital Letters: $1 fi -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 4 December 2010 21:08, Lee wrote: So... the reason for setting LANG is a shorthand method of setting all the LC_xxx environment variables? Yes. Setting LC_ALL does that too, but the difference between LC_ALL and LANG is that LC_ALL takes precedence over the specific LC_xxx variables, whereas LANG does not. Hence, LANG allows you to set all locale categories while still allowing specific ones such as LC_COLLATE to be overridden. (Perhaps things would be a bit clearer if LANG had been called LC_DEFAULT or some such.) Andy -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 12/4/10, Lee Rothstein lee@ wrote: On 12/4/2010 10:06 AM, Corinna Vinschen wrote: On Dec 4 10:05, Lee wrote: On 12/3/10, Eric Blake eblake@ wrote: Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. Which says the en_US locale collates the upper and lower case letters like this: AaBb...Zz I got that much :) What I don't get is why someone would _want_ the collating sequence to be AaBb... or why that sequence was picked for en_US instead of using the natural order of A-Za-z. It's not the natural order, it's an arbitrary order which has been chosen back in 1963 when the ASCII code has been defined. It's not used as natural order outside of computer systems and it's not even the natural order on some computer systems (See EBCDIC). If you take a look into a hardcopy encyclopedia written in english, you'll be very comfortable that the words are ordered lexicographically instead of in ASCII coding, probably. Needless to say that ordering criteria for non-english languages may contain more characters in the sequence, in german for instance AaäBb...Ooö...Ssß...Uuü...Zz So, let's reiterate: - If I need the order for the computer language, I say so: LC_COLLATE=C.UTF-8 - Otherwise, if I need the order for the natural language, I say so: LC_COLLATE=en_US.UTF-8 LC_COLLATE=de_DE.UTF-8 ... Here's my takeaway, given Corinna's interesting and complete context, and my intents. (My intentions, BTW, are for my scripts to have as much generality as possible [given my limited skills ;-|].) Therefore, instead of using '[A-Z]' to represent caps, I should have used (?) the Posixly Correct, '[:upper:]'. Close, you should have used '[[:upper:]]' $ cat t_regex #!/bin/bash # t_regex: Test test regex # By Lee Rothstein, 2010-12-03, 16:27:38 regex_test () { echo -n [A-Z] test: if [[ $1 =~ [A-Z] ]] ; then echo Contains Capital Letters: $1 else echo Doesn\'t Contain Capital Letters: $1 fi echo -n [:upper:] test: if [[ $1 =~ [[:upper:]] ]] ; then echo Contains Capital Letters: $1 else echo Doesn\'t Contain Capital Letters: $1 fi } unset LC_COLLATE export LANG=C.UTF-8 echo === LANG=$LANG regex_test dfgh regex_test Dfgh echo echo export LANG=en_US.UTF-8 echo === LANG=$LANG regex_test dfgh regex_test Dfgh ~/src $ ./t_regex === LANG=C.UTF-8 [A-Z] test: Doesn't Contain Capital Letters: dfgh [:upper:] test: Doesn't Contain Capital Letters: dfgh [A-Z] test: Contains Capital Letters: Dfgh [:upper:] test: Contains Capital Letters: Dfgh === LANG=en_US.UTF-8 [A-Z] test: Contains Capital Letters: dfgh [:upper:] test: Doesn't Contain Capital Letters: dfgh [A-Z] test: Contains Capital Letters: Dfgh [:upper:] test: Contains Capital Letters: Dfgh ~/src $ Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 12/04/2010 02:49 PM, Lee Rothstein wrote: Therefore, instead of using '[A-Z]' to represent caps, I should have used (?) the Posixly Correct, '[:upper:]'. POSIX 2001 and 2008 says that [A-Z] when used as a glob or as a regex is defined _only_ in the C locale; in all other locales, it's behavior is unspecified. Meanwhile, [[:upper:]] (note the double [ and ]) is well-defined in all locales (the next version of POSIX will make it more obvious that [:upper:] might be treated as either [:epru] or [[:upper:]], if not outright rejected as an error). It all stems from the earlier POSIX 1992, which required [A-Z] to match collation order in all locales. POSIX 2001 withdrew that requirement based on how many people it confused, but the damage was already done - it is no longer portable because of people that literally implemented the older requirement (bash included). Now, what would be really nice is if all implementations treated unspecified behavior for [A-Z] as meaning a sane synonym for [[:upper:]], but that's not going to happen any time soon. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
Re: Problem with Bash regex test case sensitivity
On 12/4/2010 5:34 PM, Lee wrote: On 12/4/10, Lee Rothstein wrote: On 12/4/2010 10:06 AM, Corinna Vinschen wrote: On Dec 4 10:05, Lee wrote: On 12/3/10, Eric Blake wrote: Here's my takeaway, given Corinna's interesting and complete context, and my intents. (My intentions, BTW, are for my scripts to have as much generality as possible [given my limited skills ;-|].) Therefore, instead of using '[A-Z]' to represent caps, I should have used (?) the Posixly Correct, '[:upper:]'. Close, you should have used '[[:upper:]]' That works! Sorry to be so obtuse. Thanks, (or should I say Nevermind -- http://www.youtube.com/v/V3FnpaWQJO0?fs=1amp;hl=en_US) emiLee [Litella] ;-) (http://www.hulu.com/watch/4092/saturday-night-live-update-2-emily-litella) (RIP, Gilda Radner) -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 2010-12-03 22:30Z, Lee Rothstein wrote: [script:] if [[ $1 =~ [A-Z] ]] ; then echo Contains Capital Letters: $1 else echo Doesn\'t Contain Capital Letters: $1 fi [...] # WTF, O $ t_regex dfgh Contains Capital Letters: dfgh Inspect this option: shopt -p | grep nocasematch Perhaps you have it set in your startup files? Example of different 'nocasematch' settings with the same command: $ shopt -u nocasematch $ if [[ a =~ [A-Z] ]] ; then echo UPPER; else echo lower; fi lower $ shopt -s nocasematch $ if [[ a =~ [A-Z] ]] ; then echo UPPER; else echo lower; fi UPPER -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 12/3/10, Lee Rothstein wrote: Having some problems with bash case-sensitive regexes, so I wrote this little test. ... snip ... Do I have some Bash or Cygwin parameter set that engenders case insensitivity? Probably the same thing I ran into with LANG != C try this little test: $ cat t_regex #!/bin/bash # t_regex: Test test regex # By Lee Rothstein, 2010-12-03, 16:27:38 regex_test () { if [[ $1 =~ [A-Z] ]] ; then echo Contains Capital Letters: $1 else echo Doesn\'t Contain Capital Letters: $1 fi } export LANG=C.UTF-8 regex_test dfgh export LANG=en_US.UTF-8 regex_test dfgh ~/src $ ./t_regex Doesn't Contain Capital Letters: dfgh Contains Capital Letters: dfgh Or, is this a bug? Welcome to the new world order :-0 I tried to figure out why the collating sequence changes with the language settings but didn't get anywhere beyond the fact that it _does_ change. Oh well.. try, try again. Regards, Lee -- Problem reports: http://cygwin.com/problems.html FAQ: http://cygwin.com/faq/ Documentation: http://cygwin.com/docs.html Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple
Re: Problem with Bash regex test case sensitivity
On 12/03/2010 07:11 PM, Lee wrote: Or, is this a bug? No, but a feature of your locale. Set 'export LC_COLLATE=C', and use LANG rather than LC_ALL for all your other locale defaults, in your ~/.bashrc if you don't like it. Welcome to the new world order :-0 I tried to figure out why the collating sequence changes with the language settings but didn't get anywhere beyond the fact that it _does_ change. Oh well.. try, try again. Read the FAQ. http://www.faqs.org/faqs/unix-faq/shell/bash/, E9. -- Eric Blake ebl...@redhat.com+1-801-349-2682 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature