[discuss] One more locale question

Alexander Pyhalov via illumos-discuss Thu, 24 Jul 2014 19:29:32 -0700

Hello.
During gnu grep update I've found out that one test fails, specifically
 gawk 'BEGIN { printf "\xe2\x80\x80\n" }'  doesn't match for grep '\s'


GNU grep testsuite checks that following UTF-8 symbols are spaces:

utf8_space_characters=$(sed 's/.*://;s/  */\\x/g' <<\EOF
U+0009 Horizontal Tab:            09
U+000B Vertical Tab:              0b
U+000C Form feed:                 0c
U+000D Carriage return:           0d
U+0020 SPACE:                     20
U+1680 OGHAM SPACE MARK:          e1 9a 80
U+2000 EN QUAD:                   e2 80 80
U+2001 EM QUAD:                   e2 80 81
U+2002 EN SPACE:                  e2 80 82
U+2003 EM SPACE:                  e2 80 83
U+2004 THREE-PER-EM SPACE:        e2 80 84
U+2005 FOUR-PER-EM SPACE:         e2 80 85
U+2006 SIX-PER-EM SPACE:          e2 80 86
U+2008 PUNCTUATION SPACE:         e2 80 88
U+2009 THIN SPACE:                e2 80 89
U+200A HAIR SPACE:                e2 80 8a
U+205F MEDIUM MATHEMATICAL SPACE: e2 81 9f
U+3000 IDEOGRAPHIC SPACE:         e3 80 80
EOF
)

Checks for
e1 9a 80
e2 80 80 - e2 80 8a
e2 81 9f, e3 80 80
fail.

I've verified whith the following C99 program
#include <wchar.h>
#include <wctype.h>
#include <locale.h>
#include <stdio.h>
void try_with(wchar_t c, const char* loc)
{
    setlocale(LC_ALL, loc);
    printf("in locale %s iswspace returned  %d\n",loc,iswspace(c));
}
int main()
{
//    wchar_t EM_SPACE = L'\u2003'; // Unicode character 'EM SPACE'
    wchar_t EM_SPACE = L'\u205f';
    try_with(EM_SPACE, "C");
    try_with(EM_SPACE, "en_US.UTF-8");
}

that iswspace considers \u2003 (as I understand it corresponds to e2 8083) and \u205f ( e2 81 9f) non-spaces.I've run the same test program on FreeBSD. It considers both charactersspaces in en_US.UTF-8 locale.

Is it a bug or do I miss something?

--
System Administrator of Southern Federal University Computer Center


-------------------------------------------
illumos-discuss
Archives: https://www.listbox.com/member/archive/182180/=now
RSS Feed: https://www.listbox.com/member/archive/rss/182180/21175430-2e6923be
Modify Your Subscription: 
https://www.listbox.com/member/?member_id=21175430&id_secret=21175430-6a77cda4
Powered by Listbox: http://www.listbox.com

[discuss] One more locale question

Reply via email to