Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-10 Thread Aníbal Monsalve Salazar
On Sat, Apr 10, 2010 at 08:46:06AM +0200, Sven Joachim wrote:
On 2010-04-10 08:06 +0200, Aníbal Monsalve Salazar wrote:

Version: 2.6.3-1

On Sat, Apr 10, 2010 at 01:54:51PM +0900, Norihiro Tanaka wrote:
Hi,

I seem that is expected behavior. [A-Z] includes  A,b,B,c,C,...y,Y,z,Z
in en_US locale (not including `a').

According to the NEWS file, this should have been the case since grep
2.5.  I wonder if some of the now removed Debian patches prohibited
that behavior, at least there is no indication that it had been turned
off deliberately in Debian's versions.

I'll check the removed patches in 2.5.4-4 to find out if one of them
was responsible for that behavior.

Right. Closing this bug report accordingly.

grep -E '^[[:upper:]]' /etc/passwd

You could use the command above.

This is also very much locale dependent, so if only ASCII uppercase
letters are to be matched, the locale should be set to C or POSIX in
any case.

Sven



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-10 Thread Jim Meyering
Aníbal Monsalve Salazar wrote:
 I reproduced this bug, see below.

 grep --version
 GNU grep 2.6.3

 cat /tmp/a
 root:x:0:0:root:/root:/bin/bash
 anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash
 Debian-exim:x:102:104::/var/spool/exim4:/bin/false
 ntp:x:106:108::/home/ntp:/bin/false

 grep -E '^[A-Z]' /tmp/a
 root:x:0:0:root:/root:/bin/bash
 Debian-exim:x:102:104::/var/spool/exim4:/bin/false
 ntp:x:106:108::/home/ntp:/bin/false

 grep -Ev '^[A-Z]' /tmp/a
 anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash

Thanks for Cc'ing bug-grep, however this is not a bug in grep-2.6.3.
Rather, it demonstrates that grep-2.5.4-4 failed to honor your locale
settings.

As you noticed, what the [A-Z] range matches depends on your locale settings.
Run locale to print those settings.

In the C (aka POSIX) locale [A-Z] matches ASCII upper case ABC...Z,
but in many other locales it matches AbBbCc...Zz.
Demonstrate with this:

  $ for i in a A b B c C; do \
printf $i: ; echo $i | LC_ALL=en_US.UTF-8 grep -E '[A-Z]' || echo; done
  a:
  A: A
  b: b
  B: B
  c: c
  C: C

If you really want to match only the 26 ASCII upper case letters,
you can run grep in the C locale, even using that risky range notation:

  $ echo b | LC_ALL=C grep '[A-Z]'
  [Exit 1]
  $

However, it's better to avoid the '[A-Z]' range notation and to
prefer the '[[:upper:]]' character class.

Using the [[:CLASS_NAME:]] notation is essential if you also
want to match other (non-ASCII) upper case characters in your locale:

  $ echo É | LC_ALL=fr_FR.UTF-8 grep '[[:upper:]]'
  É

Using range notation is often not what you want:

  $ echo á | LC_ALL=fr_FR.UTF-8 grep '[A-F]'
  á



--
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-09 Thread Bernd Zeimetz
Package: grep
Version: 2.6.3-1
Severity: grave

As this issue might affact a lot of cases where grep is used, I've
decided to file it with the severity grave.

Since version 2.6.3 (and it seems also 2.5.2 was affected), the
behaviour of grep regarding capital letters in bracket expressions
changed when using UTF8:

With grep 2.6.3:
b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd | wc -l
51
b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd 
Debian-exim:x:100:103::/var/spool/exim4:/bin/false

With grep 2.5.4-4:
b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd  
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd   
Debian-exim:x:100:103::/var/spool/exim4:/bin/false

This behaviour change is not expected and different from what other
implementations do.

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.33-think (SMP w/2 CPU cores; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages grep depends on:
ii  dpkg  1.15.5.6   Debian package management system
ii  install-info  4.13a.dfsg.1-5 Manage installed documentation in 
ii  libc6 2.10.2-6   Embedded GNU C Library: Shared lib

grep recommends no packages.

Versions of packages grep suggests:
ii  libpcre3  7.8-3  Perl 5 Compatible Regular Expressi

-- no debconf information



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-09 Thread Aníbal Monsalve Salazar
On Fri, Apr 09, 2010 at 04:35:35PM +0200, Bernd Zeimetz wrote:
Package: grep
Version: 2.6.3-1
Severity: grave

As this issue might affact a lot of cases where grep is used, I've
decided to file it with the severity grave.

Since version 2.6.3 (and it seems also 2.5.2 was affected), the
behaviour of grep regarding capital letters in bracket expressions
changed when using UTF8:

With grep 2.6.3:
b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd | wc -l
51
b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd 
Debian-exim:x:100:103::/var/spool/exim4:/bin/false

With grep 2.5.4-4:
b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd  
Debian-exim:x:100:103::/var/spool/exim4:/bin/false
b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd   
Debian-exim:x:100:103::/var/spool/exim4:/bin/false

This behaviour change is not expected and different from what other
implementations do.

-- System Information:
Debian Release: squeeze/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.33-think (SMP w/2 CPU cores; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages grep depends on:
ii  dpkg  1.15.5.6   Debian package management system
ii  install-info  4.13a.dfsg.1-5 Manage installed documentation in 
ii  libc6 2.10.2-6   Embedded GNU C Library: Shared lib

grep recommends no packages.

Versions of packages grep suggests:
ii  libpcre3  7.8-3  Perl 5 Compatible Regular Expressi

-- no debconf information

I reproduced this bug, see below.

grep --version
GNU grep 2.6.3

cat /tmp/a
root:x:0:0:root:/root:/bin/bash
anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash
Debian-exim:x:102:104::/var/spool/exim4:/bin/false
ntp:x:106:108::/home/ntp:/bin/false

grep -E '^[A-Z]' /tmp/a
root:x:0:0:root:/root:/bin/bash
Debian-exim:x:102:104::/var/spool/exim4:/bin/false
ntp:x:106:108::/home/ntp:/bin/false

grep -Ev '^[A-Z]' /tmp/a
anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org



Bug#577095: grep: bracket expressions fails depending on the locale

2010-04-09 Thread Norihiro Tanaka
Hi,

I seem that is expected behavior. [A-Z] includes  A,b,B,c,C,...y,Y,z,Z
in en_US locale (not include `a').




-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org