Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

Johannes Meixner Thu, 14 Jun 2012 04:07:34 -0700


Hello,


On Jun 1 12:02 Jim Meyering wrote (excerpt):

... the way grep's -i is implemented: it converts both the RE and
the buffer-to-search to lower case, and then performs the search.


I wonder if "convert ... to lower case" is really a correct
implementation for caseless matching because in
http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf
I found that
"case folding ... is more than just conversion to lowercase":
--------------------------------------------------------------------
Implementation Guidelines
...
5.18 Case Mappings
...
Complications for Case Mapping
...
Context-dependent Case Mappings.
Characters may have different case mappings, depending on
the context surrounding the character in the original string.
For example, U+03A3 [greek capital letter sigma] lowercases
to U+03C3 [greek small letter sigma] if it is followed by
another letter, but lowercases to U+03C2 [greek small letter
final sigma] if it is not.
...
Caseless Matching
Caseless matching is implemented using case folding, which is the
process of mapping characters of different case to a single form,
so that case differences in strings are erased. Case folding allows
for fast caseless matches in lookups because only binary comparison
is required. It is more than just conversion to lowercase. For
example, it correctly handles cases such as the Greek sigma...
--------------------------------------------------------------------

Is grep's -i implemented via plain convert to lower case
or is it actually implemented via "case folding"?


FYI:

http://www.unicode.org/versions/Unicode6.1.0/ch05.pdf
describes in particular the "Turkish I" issue in detail...


Kind Regards
Johannes Meixner
--
SUSE LINUX Products GmbH -- Maxfeldstrasse 5 -- 90409 Nuernberg -- Germany
HRB 16746 (AG Nuernberg) GF: Jeff Hawn, Jennifer Guild, Felix Imendoerffer

Re: [bug #36567] grep -i (case-insensitive) is broken with UTF8

Reply via email to