Bug ID: 85824
           Summary: regex constructor crashes under UTF-8 locale on
                    Solaris-sparc when parsing a simple character class
           Product: gcc
           Version: 4.9.2
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot
          Reporter: wanyingloo at gmail dot com
  Target Milestone: ---

$ cat test.cpp
#include <locale.h>
#include <regex>

int main (int argc, char *argv[]) {
    setlocale(LC_ALL, "");

$ echo $LANG

$ g++ -std=c++11 test.cpp

$ ./a.out 
terminate called after throwing an instance of 'std::length_error'
  what():  basic_string::append
Abort (core dumped)

$ uname -a
SunOS t-solaris11sparc-02 5.11 11.3 sun4v sparc sun4v Solaris

$ g++ -v
Using built-in specs.
Target: sparc-sun-solaris2.10
Configured with: ../gcc-4.9.2/configure --prefix=/usr
--with-local-prefix=/usr/local --enable-languages=c,c++ --disable-nls
--disable-lto --enable-clocale=generic
-static-libgcc -static-libstdc++ -laio -lmd'
-static-libgcc -static-libstdc++ -laio -lmd' --disable-werror
--with-gnu-ld --with-gnu-as --disable-multiarch --disable-bootstrap
Thread model: posix
gcc version 4.9.2 (GCC) 

I can't reproduce it on Linux using the same GCC version. I did some
investigation and it seems to be because regex compiler doesn't account for
implementation-defined behavior of strxfrm(). I ran the following test on the
same Solaris SPARC machine.

$ cat more_test.cpp 
#include <locale.h>
#include <errno.h>
#include <stdio.h>
#include <string.h>

int main (int argc, char *argv[]) {
    setlocale(LC_ALL, "");
    char a[] = { 0x80, '\0' };
    printf("%lu\n", strxfrm(NULL, a, 0));
    printf("%s\n", strerror(errno));

$ g++ -std=c++11 -w more_test.cpp 

$ ./a.out 
Illegal byte sequence

In libstdc++-v3/include/bits/locale_classes.tcc, do_transform() is defined as

    do_transform(const _CharT* __lo, const _CharT* __hi) const
              size_t __res = _M_transform(__c, __p, __len);
              __ret.append(__c, __res);

When _M_transform() calls strxfrm() and gets -1 when converting 0x80 under the
UTF-8 locale on Solaris SPARC, it simply assigns -1 to __res of type size_t
which creates a very large number. This causes __ret.append(__c, __res) to
crash. I think it would be nice if the code checks errno and issues a better
error message than the one above.

When I ran this test on a Linux machine with GCC 4.9.2, glibc's strxfrm()
converts 0x80 to 6 bytes. I tend to think Solaris SPARC's libc behavior makes
more sense here since 0x80 on its own isn't a valid UTF-8 code point even
though it's a valid UTF-8 code unit. I have no idea why glibc converts it to 6
bytes. In any event, how strxfrm() converts 0x80 under UTF-8 is
implementation-defined, and I'm not sure do_transform() accounts for that. At
the very least, it can be more defensive by checking errno, I think.

Reply via email to