Hello.
During gnu grep update I've found out that one test fails, specifically
gawk 'BEGIN { printf "\xe2\x80\x80\n" }' doesn't match for grep '\s'
GNU grep testsuite checks that following UTF-8 symbols are spaces:
utf8_space_characters=$(sed 's/.*://;s/ */\\x/g' <<\EOF
U+0009 Horizontal Tab: 09
U+000B Vertical Tab: 0b
U+000C Form feed: 0c
U+000D Carriage return: 0d
U+0020 SPACE: 20
U+1680 OGHAM SPACE MARK: e1 9a 80
U+2000 EN QUAD: e2 80 80
U+2001 EM QUAD: e2 80 81
U+2002 EN SPACE: e2 80 82
U+2003 EM SPACE: e2 80 83
U+2004 THREE-PER-EM SPACE: e2 80 84
U+2005 FOUR-PER-EM SPACE: e2 80 85
U+2006 SIX-PER-EM SPACE: e2 80 86
U+2008 PUNCTUATION SPACE: e2 80 88
U+2009 THIN SPACE: e2 80 89
U+200A HAIR SPACE: e2 80 8a
U+205F MEDIUM MATHEMATICAL SPACE: e2 81 9f
U+3000 IDEOGRAPHIC SPACE: e3 80 80
EOF
)
Checks for
e1 9a 80
e2 80 80 - e2 80 8a
e2 81 9f, e3 80 80
fail.
I've verified whith the following C99 program
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#include <stdio.h>
void try_with(wchar_t c, const char* loc)
{
setlocale(LC_ALL, loc);
printf("in locale %s iswspace returned %d\n",loc,iswspace(c));
}
int main()
{
// wchar_t EM_SPACE = L'\u2003'; // Unicode character 'EM SPACE'
wchar_t EM_SPACE = L'\u205f';
try_with(EM_SPACE, "C");
try_with(EM_SPACE, "en_US.UTF-8");
}
that iswspace considers \u2003 (as I understand it corresponds to e2 80
83) and \u205f ( e2 81 9f) non-spaces.
I've run the same test program on FreeBSD. It considers both characters
spaces in en_US.UTF-8 locale.
Is it a bug or do I miss something?
--
System Administrator of Southern Federal University Computer Center
-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription:
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com