Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Lee
On 12/3/10, Eric Blake eblake@  wrote:
 On 12/03/2010 07:11 PM, Lee wrote:
 Or, is this a bug?

 No, but a feature of your locale.  Set 'export LC_COLLATE=C', and use
 LANG rather than LC_ALL for all your other locale defaults, in your
 ~/.bashrc if you don't like it.

Nice tip - thank you.  But is there a reason I'd want LANG set to
en_US.UTF-8 instead of C.UTF-8?  As far as I can tell, everything
works for me with LANG=C.UTF-8.  Other than changing the collating
sequence to something I don't want, what does LANG=en_US.UTF-8 get me
that LANG=C.UTF-8 doesn't?

 as long as I'm showing how ignorant I am...  why put the local
defaults in ~/.bashrc?  My understanding is that ~/.bashrc is called
at every shell startup.  Seems like that's one of those things that
just needs to be set in the login shell, so wouldn't ~/.bash_profile
be more appropriate for the locale settings?

 Welcome to the new world order :-0   I tried to figure out why the
 collating sequence changes with the language settings but didn't get
 anywhere beyond the fact that it _does_ change.  Oh well.. try, try
 again.

 Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.

Which says the en_US locale collates the upper and lower case letters like this:
AaBb...Zz

I got that much :)  What I don't get is why someone would _want_ the
collating sequence to be AaBb... or why that sequence was picked for
en_US instead of using the natural order of A-Za-z.

Regards,
Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Corinna Vinschen
On Dec  4 10:05, Lee wrote:
 On 12/3/10, Eric Blake eblake@  wrote:
  Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.
 
 Which says the en_US locale collates the upper and lower case letters like 
 this:
   AaBb...Zz
 
 I got that much :)  What I don't get is why someone would _want_ the
 collating sequence to be AaBb... or why that sequence was picked for
 en_US instead of using the natural order of A-Za-z.

It's not the natural order, it's an arbitrary order which has been
chosen back in 1963 when the ASCII code has been defined.  It's not used
as natural order outside of computer systems and it's not even the
natural order on some computer systems (See EBCDIC).

If you take a look into a hardcopy encyclopedia written in english,
you'll be very comfortable that the words are ordered lexicographically
instead of in ASCII coding, probably.  Needless to say that ordering
criteria for non-english languages may contain more characters in the
sequence, in german for instance

  AaäBb...Ooö...Ssß...Uuü...Zz

So, let's reiterate:

- If I need the order for the computer language, I say so:

   LC_COLLATE=C.UTF-8

- Otherwise, if I need the order for the natural language, I say so:

   LC_COLLATE=en_US.UTF-8
   LC_COLLATE=de_DE.UTF-8
   ...


Corinna

-- 
Corinna Vinschen  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader  cygwin AT cygwin DOT com
Red Hat

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread David Sastre
On Sat, Dec 04, 2010 at 10:05:42AM -0400, Lee wrote:
  On 12/03/2010 07:11 PM, Lee wrote:
 
 why put the local
 defaults in ~/.bashrc?  My understanding is that ~/.bashrc is called
 at every shell startup.  Seems like that's one of those things that
 just needs to be set in the login shell, so wouldn't ~/.bash_profile
 be more appropriate for the locale settings?

(Most probably you already know all of this, but...)
As of now, the default settings are provided via /etc/profile:

if [ -d /etc/profile.d ]; then
  while read f; do
if [ -f ${f} ]; then
  . ${f}
fi
  done - EOF
  `/bin/find -L /etc/profile.d -type f -iname '*.sh' -or -iname '*.zsh' | 
LC_ALL=C sort`
  EOF
fi

which in turn sources /etc/profile.d/lang.sh:

# if no locale variable is set, indicate terminal charset via LANG
test -z ${LC_ALL:-${LC_CTYPE:-$LANG}}  export LANG=C.UTF-8

The bash manual page explains the order in which startup files are
read for both login and non-login shells (both interactive and
non-interactive). So, given that ~/.bash_profile sources ~/.bashrc, (in
our cygwin defaults), that looks like an easy way to set your LANG in a 
per-user manner, no matter what kind of shell you open.
If you want it to be a system-wide setting, you should use
/etc/bash.bashrc (for the bash shell, of course).
Setting it only in ~/.bash_profile makes it invisible for non-login
shells.

-- 
Huella de clave primaria: 0FDA C36F F110 54F4 D42B  D0EB 617D 396C 448B 31EB


signature.asc
Description: Digital signature


Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Lee
On 12/4/10, Corinna Vinschen corinna-cygwin  wrote:
 On Dec  4 10:05, Lee wrote:
 On 12/3/10, Eric Blake eblake@  wrote:
  Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.

 Which says the en_US locale collates the upper and lower case letters like
 this:
  AaBb...Zz

 I got that much :)  What I don't get is why someone would _want_ the
 collating sequence to be AaBb... or why that sequence was picked for
 en_US instead of using the natural order of A-Za-z.

 It's not the natural order, it's an arbitrary order which has been
 chosen back in 1963 when the ASCII code has been defined.  It's not used
 as natural order outside of computer systems and it's not even the
 natural order on some computer systems (See EBCDIC).

My idea of natural order is treating each character as an unsigned
integer.  So even though ASCII has a different collating sequence than
EBCDIC, the characters are still treated as unsigned integers when
sorting them.  Setting LANG to something other than C seems to break
that model..

 If you take a look into a hardcopy encyclopedia written in english,
 you'll be very comfortable that the words are ordered lexicographically
 instead of in ASCII coding, probably.

I never paid all that much attention to how the words were ordered,
but now that I have.. they're backwards!   god comes before God,
hopper before Hopper, etc.

  Needless to say that ordering
 criteria for non-english languages may contain more characters in the
 sequence, in german for instance

   AaäBb...Ooö...Ssß...Uuü...Zz

 So, let's reiterate:

 - If I need the order for the computer language, I say so:

LC_COLLATE=C.UTF-8

 - Otherwise, if I need the order for the natural language, I say so:

LC_COLLATE=en_US.UTF-8
LC_COLLATE=de_DE.UTF-8

You're quite good at explaining this.. I think I'm actually beginning
to understand it :)
So...  the reason for setting LANG is a shorthand method of setting
all the LC_xxx environment variables?

Thanks,
Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Lee Rothstein

On 12/4/2010 10:06 AM, Corinna Vinschen wrote:

 On Dec  4 10:05, Lee wrote:

 On 12/3/10, Eric Blake eblake@  wrote:
 Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.

 Which says the en_US locale collates the upper and lower case
 letters like this:
 AaBb...Zz

 I got that much :)  What I don't get is why someone would _want_ the
 collating sequence to be AaBb... or why that sequence was picked for
 en_US instead of using the natural order of A-Za-z.

 It's not the natural order, it's an arbitrary order which has been
 chosen back in 1963 when the ASCII code has been defined.  It's not used
 as natural order outside of computer systems and it's not even the
 natural order on some computer systems (See EBCDIC).

 If you take a look into a hardcopy encyclopedia written in english,
 you'll be very comfortable that the words are ordered lexicographically
 instead of in ASCII coding, probably.  Needless to say that ordering
 criteria for non-english languages may contain more characters in the
 sequence, in german for instance

   AaäBb...Ooö...Ssß...Uuü...Zz

 So, let's reiterate:

 - If I need the order for the computer language, I say so:

LC_COLLATE=C.UTF-8

 - Otherwise, if I need the order for the natural language, I
   say so:

LC_COLLATE=en_US.UTF-8
LC_COLLATE=de_DE.UTF-8
...

Here's my takeaway, given Corinna's interesting and complete
context, and my intents. (My intentions, BTW, are for my scripts
to have as much generality as possible [given my limited skills
;-|].)

Therefore, instead of using '[A-Z]' to represent caps, I should
have used (?) the Posixly Correct, '[:upper:]'.

However, the test script (attached) still doesn't work on either
my Cygwin config, or a Linux config, with this change. (I have
not yet made the above indicated environment variable changes,
since I am still waiting for clarification to the new issue I
bring up, here.)

The latter test would, IMHO, seem to imply that the changes to
NIX shells were mandated by I18N considerations, BUT the other
required changes in code or default setting were NOT implemented.

This would seem to penalize only those folks who are conversant
with long-term convention of the 'NIX world.

Please correct my misunderstanding if I'm wrong!

Lee

#!/bin/bash

# t_regex: Tutorial on regex, test

# By Lee Rothstein, 2010-12-04, 13:57:54

# Each Test performed on:

# * CYGWIN_NT-6.0-WOW64 1.7.7(0.230/5/3)
# * Linux 2.6.15-55-amd64-generic

#if [[ $1 =~ [A-Z] ]] ; then   # doesn't work
if [[ $1 =~ [:upper:] ]] ; then   # doesn't work
#if [[ $1 =~ [ABCDEFGHIJKLMNOPQRSTUVWXYZ] ]] ; then  # Works
  echo Contains Capital Letters: $1
else
  echo Doesn\'t Contain Capital Letters: $1
fi
--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple

Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Andy Koppe
On 4 December 2010 21:08, Lee wrote:
 So...  the reason for setting LANG is a shorthand method of setting
 all the LC_xxx environment variables?

Yes. Setting LC_ALL does that too, but the difference between LC_ALL
and LANG is that LC_ALL takes precedence over the specific LC_xxx
variables, whereas LANG does not. Hence, LANG allows you to set all
locale categories while still allowing specific ones such as
LC_COLLATE to be overridden. (Perhaps things would be a bit clearer if
LANG had been called LC_DEFAULT or some such.)

Andy

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Lee
On 12/4/10, Lee Rothstein lee@  wrote:
 On 12/4/2010 10:06 AM, Corinna Vinschen wrote:

   On Dec  4 10:05, Lee wrote:

   On 12/3/10, Eric Blake eblake@  wrote:
   Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.

   Which says the en_US locale collates the upper and lower case
   letters like this:
   AaBb...Zz

   I got that much :)  What I don't get is why someone would _want_ the
   collating sequence to be AaBb... or why that sequence was picked for
   en_US instead of using the natural order of A-Za-z.

   It's not the natural order, it's an arbitrary order which has been
   chosen back in 1963 when the ASCII code has been defined.  It's not used
   as natural order outside of computer systems and it's not even the
   natural order on some computer systems (See EBCDIC).

   If you take a look into a hardcopy encyclopedia written in english,
   you'll be very comfortable that the words are ordered lexicographically
   instead of in ASCII coding, probably.  Needless to say that ordering
   criteria for non-english languages may contain more characters in the
   sequence, in german for instance

 AaäBb...Ooö...Ssß...Uuü...Zz

   So, let's reiterate:

   - If I need the order for the computer language, I say so:

  LC_COLLATE=C.UTF-8

   - Otherwise, if I need the order for the natural language, I
 say so:

  LC_COLLATE=en_US.UTF-8
  LC_COLLATE=de_DE.UTF-8
  ...

 Here's my takeaway, given Corinna's interesting and complete
 context, and my intents. (My intentions, BTW, are for my scripts
 to have as much generality as possible [given my limited skills
 ;-|].)

 Therefore, instead of using '[A-Z]' to represent caps, I should
 have used (?) the Posixly Correct, '[:upper:]'.

Close, you should have used '[[:upper:]]'

$ cat t_regex
#!/bin/bash
# t_regex: Test test regex
# By Lee Rothstein, 2010-12-03, 16:27:38

regex_test () {

echo -n [A-Z] test: 
if [[ $1 =~ [A-Z] ]] ; then
   echo Contains Capital Letters: $1
else
   echo Doesn\'t Contain Capital Letters: $1
fi

echo -n [:upper:] test: 
if [[ $1 =~ [[:upper:]] ]] ; then
   echo Contains Capital Letters: $1
else
   echo Doesn\'t Contain Capital Letters: $1
fi

}

unset LC_COLLATE
export LANG=C.UTF-8
echo === LANG=$LANG
regex_test dfgh
regex_test Dfgh

echo
echo

export LANG=en_US.UTF-8
echo === LANG=$LANG
regex_test dfgh
regex_test Dfgh


 ~/src
$ ./t_regex
=== LANG=C.UTF-8
[A-Z] test: Doesn't Contain Capital Letters: dfgh
[:upper:] test: Doesn't Contain Capital Letters: dfgh
[A-Z] test: Contains Capital Letters: Dfgh
[:upper:] test: Contains Capital Letters: Dfgh


=== LANG=en_US.UTF-8
[A-Z] test: Contains Capital Letters: dfgh
[:upper:] test: Doesn't Contain Capital Letters: dfgh
[A-Z] test: Contains Capital Letters: Dfgh
[:upper:] test: Contains Capital Letters: Dfgh

 ~/src
$


Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Eric Blake
On 12/04/2010 02:49 PM, Lee Rothstein wrote:
 Therefore, instead of using '[A-Z]' to represent caps, I should
 have used (?) the Posixly Correct, '[:upper:]'.

POSIX 2001 and 2008 says that [A-Z] when used as a glob or as a regex is
defined _only_ in the C locale; in all other locales, it's behavior is
unspecified.  Meanwhile, [[:upper:]] (note the double [ and ]) is
well-defined in all locales (the next version of POSIX will make it more
obvious that [:upper:] might be treated as either [:epru] or
[[:upper:]], if not outright rejected as an error).  It all stems from
the earlier POSIX 1992, which required [A-Z] to match collation order in
all locales.  POSIX 2001 withdrew that requirement based on how many
people it confused, but the damage was already done - it is no longer
portable because of people that literally implemented the older
requirement (bash included).

Now, what would be really nice is if all implementations treated
unspecified behavior for [A-Z] as meaning a sane synonym for
[[:upper:]], but that's not going to happen any time soon.

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: Problem with Bash regex test case sensitivity

2010-12-04 Thread Lee Rothstein

On 12/4/2010 5:34 PM, Lee wrote:

On 12/4/10, Lee Rothstein wrote:

On 12/4/2010 10:06 AM, Corinna Vinschen wrote:

On Dec  4 10:05, Lee wrote:

On 12/3/10, Eric Blake wrote:
Here's my takeaway, given Corinna's interesting and complete
context, and my intents. (My intentions, BTW, are for my scripts
to have as much generality as possible [given my limited skills
;-|].)

Therefore, instead of using '[A-Z]' to represent caps, I should
have used (?) the Posixly Correct, '[:upper:]'.

Close, you should have used '[[:upper:]]'


That works! Sorry to be so obtuse.

Thanks, (or should I say Nevermind -- 
http://www.youtube.com/v/V3FnpaWQJO0?fs=1amp;hl=en_US)


emiLee [Litella] ;-) 
(http://www.hulu.com/watch/4092/saturday-night-live-update-2-emily-litella)


(RIP, Gilda Radner)

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-03 Thread Greg Chicares
On 2010-12-03 22:30Z, Lee Rothstein wrote:
[script:]
 if [[ $1 =~ [A-Z] ]] ; then
  echo Contains Capital Letters: $1
 else
  echo Doesn\'t Contain Capital Letters: $1
 fi
[...]
 # WTF, O
 $ t_regex dfgh
 Contains Capital Letters: dfgh

Inspect this option:
  shopt -p | grep nocasematch
Perhaps you have it set in your startup files?

Example of different 'nocasematch' settings with the same command:

$ shopt -u nocasematch
$ if [[ a =~ [A-Z] ]] ; then echo UPPER; else echo lower; fi
lower

$ shopt -s nocasematch
$ if [[ a =~ [A-Z] ]] ; then echo UPPER; else echo lower; fi
UPPER

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-03 Thread Lee
On 12/3/10, Lee Rothstein   wrote:
 Having some problems with bash case-sensitive regexes, so I wrote
 this little test.
   ... snip ...
 Do I have some Bash or Cygwin parameter set that engenders case
 insensitivity?

Probably the same thing I ran into with LANG != C
try this little test:

$ cat t_regex
#!/bin/bash
# t_regex: Test test regex
# By Lee Rothstein, 2010-12-03, 16:27:38

regex_test () {
if [[ $1 =~ [A-Z] ]] ; then
   echo Contains Capital Letters: $1
else
   echo Doesn\'t Contain Capital Letters: $1
fi
}

export LANG=C.UTF-8
regex_test dfgh

export LANG=en_US.UTF-8
regex_test dfgh


 ~/src
$ ./t_regex
Doesn't Contain Capital Letters: dfgh
Contains Capital Letters: dfgh


 Or, is this a bug?

Welcome to the new world order :-0   I tried to figure out why the
collating sequence changes with the language settings but didn't get
anywhere beyond the fact that it _does_ change.  Oh well.. try, try
again.

Regards,
Lee

--
Problem reports:   http://cygwin.com/problems.html
FAQ:   http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info:  http://cygwin.com/ml/#unsubscribe-simple



Re: Problem with Bash regex test case sensitivity

2010-12-03 Thread Eric Blake
On 12/03/2010 07:11 PM, Lee wrote:
 Or, is this a bug?

No, but a feature of your locale.  Set 'export LC_COLLATE=C', and use
LANG rather than LC_ALL for all your other locale defaults, in your
~/.bashrc if you don't like it.

 
 Welcome to the new world order :-0   I tried to figure out why the
 collating sequence changes with the language settings but didn't get
 anywhere beyond the fact that it _does_ change.  Oh well.. try, try
 again.

Read the FAQ.  http://www.faqs.org/faqs/unix-faq/shell/bash/, E9.

-- 
Eric Blake   ebl...@redhat.com+1-801-349-2682
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature