Package: locales Version: 2.10.2-6 Severity: normal Hi,
in Hungarian, "zs" (as well as "sz", "cs", "ty", "dz", "dzs", "gy" and "ly") are said to be part of the alphabet and each combination is considered to be a single letter; however, they are represented by two or more characters; there aren't single glyphs for them. "zs" in particular is causing trouble for grep: % echo zs | LANG=C grep '^[^a-z]*$' % echo zs | LANG=hu_HU.UTF-8 grep '^[^a-z]*$' zs It's possible to come up with expressions that lead to similarly unexpected results for the other multi-char letters as well, but these don't occur frequently: % echo ty | LANG=C grep '^[s-u]*$' % echo ty | LANG=hu_HU.UTF-8 grep '^[s-u]*$' ty This is undesirable and dumb, for several reasons: 1. grep has no way of knowing whether a "zs" sequence is a "single letter" or two letters, because the combination can occur in compound words without becoming a "zs" letter; for example, in "fúvószenekar" ("fúvós" + "zenekar"), it's simply an "s" and a "z" letter next to each other. There may even exist words that make (a different) sense either way, but I can't think of any right now. 2. "zs" is the last letter of the Hungarian alphabet; therefore, no sane character range in a regular expression can include it ("[a-zs]" would be ambiguous because there isn't a "zs" glyph). "zs" and the other multi-char letters play an important role in sorting ("zs" has to be sorted after "za" and so on), but please can we treat them as two characters in all other contexts? I can also make a socio-ergonomic point: I think most people who deal with regular expressions don't expect Hungarian multi-character "letters" to be treated as single characters in regular expressions, whether they are Hungarian or not. Andras -- System Information: Debian Release: squeeze/sid APT prefers unstable APT policy: (500, 'unstable'), (1, 'experimental') Architecture: amd64 (x86_64) Kernel: Linux 2.6.32.7-vs2.3.0.36.28-hellgate (SMP w/3 CPU cores; PREEMPT) Locale: LANG=C, LC_CTYPE=hu_HU.UTF-8 (charmap=UTF-8) Shell: /bin/sh linked to /bin/bash Versions of packages locales depends on: ii debconf [debconf-2.0] 1.5.28 Debian configuration management sy ii libc6 [glibc-2.10-1] 2.10.2-2 GNU C Library: Shared libraries locales recommends no packages. locales suggests no packages. -- debconf information: * locales/default_environment_locale: None * locales/locales_to_be_generated: en_GB ISO-8859-1, en_GB.ISO-8859-15 ISO-8859-15, en_GB.UTF-8 UTF-8, en_US ISO-8859-1, en_US.ISO-8859-15 ISO-8859-15, en_US.UTF-8 UTF-8, hu_HU ISO-8859-2, hu_HU.UTF-8 UTF-8 -- Andras Korn <korn at elan.rulez.org> - <http://chardonnay.math.bme.hu/~korn/> A stitch in time would have confused Einstein. -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org