Johannes Meixner wrote: > Hello, > > On Jun 1 12:02 Jim Meyering wrote (excerpt): >> >> i='\xC4\xB0' >> printf "$i$i$i$i$i$i$i\n" > in >> LC_ALL=en_US.UTF-8 grep -i .... in > out >> cmp in out > /dev/null || echo FAIL >> >> As I mentioned in the link above, this is a problem because of the way >> grep's -i is implemented: it converts both the RE and the buffer-to-search >> to lower case, and then performs the search. The problem arises with >> turkish-I because the conversion changes the length of the buffer (in >> the example test, the input is 15 bytes long -- 7 x 2-byte I-with-dot >> + newline, yet the lower case version has a length of just 8: 7 x >> lower-cased i + NL), and the code returns the match offset and length >> relative to the shortened lower-case buffer (that lower-cased buffer is >> internal to code duplicated in EGexecute/Fexecute), yet it uses those >> offset,length numbers to manipulate the original buffer. >> >> Without re-architecting too much, one solution is to change mbtolower to >> return additional information: a malloc'd mapping vector M, of the same >> length as its returned buffer, where M[i] is the length-in-bytes of the >> character that formed byte I of the result. With that, or something >> similar, the caller could then map the currently-erroneous offset,len >> numbers to equivalent numbers that apply to the original buffer. This >> mapping could be allocated/defined only when lengths actually differ, >> so that the cost in general would be negligible. > > I am not at all a localization expert and perhaps I misunderstand > something but perhaps it is not safe to only test if lengths differ. > > I fear there exists a special locale setting where a special > multibyte character string exists where its lower-cased counterpart > has same length but nevertheless the character positions in both > strings do not match. > > I am thinking about something like a two-character string > "[3-byte-upper-case-character-1][2-byte-upper-case-character-2]" > where its lower-cased counterpart is > "[2-byte-lower-case-character-1][3-byte-lower-case-character-2]" > > Something like "[AAA][BB]" versus "[aa][bbb]" where > [AAA] is a 3-byte upper-case character where > [aa] is its 2-byte lower-case counterpart and > [BB] is a 2-byte upper-case character where > [bbb] is its 3-byte lower-case counterpart.
Nice catch. Thank you for reporting that. > Do such or similar kind of strings actually exist? I'll bet it's possible. If someone comes up with an example, please let us know. All it takes is a lower case character (in a UTF-8 locale) that is longer than its upper case companion. Then put that upper case character on a line with the turkish I-with-dot, and run grep -i to select that line. > If yes could such kind of strings still cause errors? Yes.
