Bug#577095: grep: bracket expressions fails depending on the locale
On Sat, Apr 10, 2010 at 08:46:06AM +0200, Sven Joachim wrote: On 2010-04-10 08:06 +0200, Aníbal Monsalve Salazar wrote: Version: 2.6.3-1 On Sat, Apr 10, 2010 at 01:54:51PM +0900, Norihiro Tanaka wrote: Hi, I seem that is expected behavior. [A-Z] includes A,b,B,c,C,...y,Y,z,Z in en_US locale (not including `a'). According to the NEWS file, this should have been the case since grep 2.5. I wonder if some of the now removed Debian patches prohibited that behavior, at least there is no indication that it had been turned off deliberately in Debian's versions. I'll check the removed patches in 2.5.4-4 to find out if one of them was responsible for that behavior. Right. Closing this bug report accordingly. grep -E '^[[:upper:]]' /etc/passwd You could use the command above. This is also very much locale dependent, so if only ASCII uppercase letters are to be matched, the locale should be set to C or POSIX in any case. Sven -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#577095: grep: bracket expressions fails depending on the locale
Aníbal Monsalve Salazar wrote: I reproduced this bug, see below. grep --version GNU grep 2.6.3 cat /tmp/a root:x:0:0:root:/root:/bin/bash anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash Debian-exim:x:102:104::/var/spool/exim4:/bin/false ntp:x:106:108::/home/ntp:/bin/false grep -E '^[A-Z]' /tmp/a root:x:0:0:root:/root:/bin/bash Debian-exim:x:102:104::/var/spool/exim4:/bin/false ntp:x:106:108::/home/ntp:/bin/false grep -Ev '^[A-Z]' /tmp/a anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash Thanks for Cc'ing bug-grep, however this is not a bug in grep-2.6.3. Rather, it demonstrates that grep-2.5.4-4 failed to honor your locale settings. As you noticed, what the [A-Z] range matches depends on your locale settings. Run locale to print those settings. In the C (aka POSIX) locale [A-Z] matches ASCII upper case ABC...Z, but in many other locales it matches AbBbCc...Zz. Demonstrate with this: $ for i in a A b B c C; do \ printf $i: ; echo $i | LC_ALL=en_US.UTF-8 grep -E '[A-Z]' || echo; done a: A: A b: b B: B c: c C: C If you really want to match only the 26 ASCII upper case letters, you can run grep in the C locale, even using that risky range notation: $ echo b | LC_ALL=C grep '[A-Z]' [Exit 1] $ However, it's better to avoid the '[A-Z]' range notation and to prefer the '[[:upper:]]' character class. Using the [[:CLASS_NAME:]] notation is essential if you also want to match other (non-ASCII) upper case characters in your locale: $ echo É | LC_ALL=fr_FR.UTF-8 grep '[[:upper:]]' É Using range notation is often not what you want: $ echo á | LC_ALL=fr_FR.UTF-8 grep '[A-F]' á -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#577095: grep: bracket expressions fails depending on the locale
Package: grep Version: 2.6.3-1 Severity: grave As this issue might affact a lot of cases where grep is used, I've decided to file it with the severity grave. Since version 2.6.3 (and it seems also 2.5.2 was affected), the behaviour of grep regarding capital letters in bracket expressions changed when using UTF8: With grep 2.6.3: b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd | wc -l 51 b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false With grep 2.5.4-4: b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false This behaviour change is not expected and different from what other implementations do. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.33-think (SMP w/2 CPU cores; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages grep depends on: ii dpkg 1.15.5.6 Debian package management system ii install-info 4.13a.dfsg.1-5 Manage installed documentation in ii libc6 2.10.2-6 Embedded GNU C Library: Shared lib grep recommends no packages. Versions of packages grep suggests: ii libpcre3 7.8-3 Perl 5 Compatible Regular Expressi -- no debconf information -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#577095: grep: bracket expressions fails depending on the locale
On Fri, Apr 09, 2010 at 04:35:35PM +0200, Bernd Zeimetz wrote: Package: grep Version: 2.6.3-1 Severity: grave As this issue might affact a lot of cases where grep is used, I've decided to file it with the severity grave. Since version 2.6.3 (and it seems also 2.5.2 was affected), the behaviour of grep regarding capital letters in bracket expressions changed when using UTF8: With grep 2.6.3: b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd | wc -l 51 b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false With grep 2.5.4-4: b...@think ~% LANG=en_US.UTF-8 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=C grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false b...@think ~% LANG=de_DE.iso8859-15 grep -E '^[A-Z]' /etc/passwd Debian-exim:x:100:103::/var/spool/exim4:/bin/false This behaviour change is not expected and different from what other implementations do. -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.33-think (SMP w/2 CPU cores; PREEMPT) Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages grep depends on: ii dpkg 1.15.5.6 Debian package management system ii install-info 4.13a.dfsg.1-5 Manage installed documentation in ii libc6 2.10.2-6 Embedded GNU C Library: Shared lib grep recommends no packages. Versions of packages grep suggests: ii libpcre3 7.8-3 Perl 5 Compatible Regular Expressi -- no debconf information I reproduced this bug, see below. grep --version GNU grep 2.6.3 cat /tmp/a root:x:0:0:root:/root:/bin/bash anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash Debian-exim:x:102:104::/var/spool/exim4:/bin/false ntp:x:106:108::/home/ntp:/bin/false grep -E '^[A-Z]' /tmp/a root:x:0:0:root:/root:/bin/bash Debian-exim:x:102:104::/var/spool/exim4:/bin/false ntp:x:106:108::/home/ntp:/bin/false grep -Ev '^[A-Z]' /tmp/a anibal:x:1000:1000:Anibal Monsalve Salazar,,,:/home/anibal:/bin/bash -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org
Bug#577095: grep: bracket expressions fails depending on the locale
Hi, I seem that is expected behavior. [A-Z] includes A,b,B,c,C,...y,Y,z,Z in en_US locale (not include `a'). -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of unsubscribe. Trouble? Contact listmas...@lists.debian.org