[ I am not subscribed; please keep me on the CC. ]
Hi,
>From the new grep announcement on LWN[1], I had a thought about how the
German eszett was handled. It seems that it hasn't been handled at all.
This may fall to the same resolution as the recent LJ/Lj thread[2]
though.
Basically, it seems that grep doesn't support alternates when changing
case. The uppercase of 'ß' is either 'SS' or 'ẞ' depending on the
context[3]. From some poking, only the latter is supported. My
thought[4] was that the code would generate '[ßSS]' which would be wrong
when matching and would instead need to do '(ß|SS)'. It now seems that
'(ß|SS|ẞ)' or even '(ß|[sS][sS]|ẞ)' would need to be generated instead
using the new code.
I've attached a test case I wrote based on 'turkish-eyes'. I release it
to the public domain.
Thanks,
--Ben
[1]https://lwn.net/Articles/586899/
[2]https://lists.gnu.org/archive/html/bug-grep/2014-02/msg00004.html
[3]https://en.wikipedia.org/wiki/Capital_%C3%9F
[4]https://lwn.net/Articles/587010/
#!/bin/sh
# Ensure that case-insensitive matching works with German eszett
. "${srcdir=.}/init.sh"; path_prepend_ ../src
require_en_utf8_locale_
require_compiled_in_MB_support
fail=0
L=de_DE.UTF-8
ss=$(printf '\303\237') # lowercase eszett
SS=$(printf '\341\272\236') # uppercase eszett
# Ensure that this matches:
# printf 'ß:SS ß:ẞ\n'|LC_ALL=de_DE.UTF-8 grep -i 'SS:ß ẞ:ß'
data="$ss:SS $ss:$SS"
search_str="SS:$ss $SS:$ss "
printf "$data\n" > in || framework_failure_
for opt in -E -F -G; do
LC_ALL=$L grep $opt -i "$search_str" in > out || fail=1
compare out in || fail=1
done
Exit $fail