date:20070225

Re: c++ strings and UTF-8 (other charsets)

2007-02-25 Thread Rich Felker

On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote:
 Hi!
   What I meant about UTF-8-strings in c++: I mean in c and c++ they're not 
 standard like in Java.

UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

 I think UTF-8 is a variable width multibyte charset, so 
 there are specific problems in handling them allocating the right space. I 
 mean the Glib contains something like UString and QT has its QStrings, which 
 I think are also UTF-8 capable.

All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: A call for fixing aterm/rxvt/etc...

2007-02-25 Thread Rich Felker

On Sat, Feb 24, 2007 at 01:39:25AM -0500, Rich Felker wrote:
  using luit for this sounds appealing, but in my experience luit (a)
  crashes frequently and (b) is easily confused by escape sequences and
  has no user interface for resetting all its iso-2022 state, so in
  practice it works for only a few apps.
 
 Hmm, maybe a replacement for luit is in order then.. If I omit
 iso-2022 support (which IMO is a big plus) then it should just be ~100
 lines of C.. I'll see if I can whip up a prototype sometime soon.

And here it is. Ugly but simple. Syntax is:
tconv [-i inner_encoding] [-o outer_encoding] [-e command ...]

Both encodings default to nl_langinfo(CODESET). Command defaults to
$SHELL. Bad things(tm) may happen if you set either encoding to
something stateful or ascii-incompatible (e.g. non-EUC legacy CJK
encodings) or a transliterating converter.

Actual usage to fix rxvt:
rxvt -e ./tconv -o iso-8859-1

Known bugs: termios handling is somewhat wrong and something should be
done to ensure that replacements made by iconv match the column width
of the correct character, to avoid corrupting the terminal. Maybe
deadlock situations when terminal blocks..? Other bugs?

Rich
/* Written in 2007 by Rich Felker; released to the public domain */

#define _XOPEN_SOURCE 500

#include stdlib.h
#include unistd.h
#include fcntl.h
#include unistd.h
#include sys/ioctl.h
#include stdarg.h
#include signal.h
#include locale.h
#include langinfo.h
#include errno.h
#include sys/time.h
#include sys/select.h
#include termios.h
#include iconv.h

static void dummy(int sig)
{
}

static void print(int fd, ...)
{
va_list ap;
const char *s;
va_start(ap, fd);
while ((s = va_arg(ap, const char *)))
write(fd, s, strlen(s));
}

int main(int argc, char **argv)
{
int i, j;
const char *o_enc, *i_enc;
char **cmd = 0;
int pty;
fd_set rfds, wfds;
char buf[512], buf2[1536];
static struct termios tio, tio_old;
iconv_t itoo, otoi;
char *in, *out;
size_t inb, outb;

#ifdef TIOCSWINSZ
struct winsize ws = { };

signal(SIGWINCH, dummy);
ioctl(0, TIOCGWINSZ, ws);
#endif

tcgetattr(0, tio);
tio_old = tio;
tio.c_cflag = CBAUD;
tio.c_cflag |= CS8 | CLOCAL | CREAD;
tio.c_iflag = 0;
tio.c_oflag = 0;
tio.c_lflag = 0;
tcsetattr(0, TCSANOW, tio);

setlocale(LC_CTYPE, );
i_enc = o_enc = nl_langinfo(CODESET);

for (i=1; iargc  !cmd; i++) {
if (argv[i][0] != '-') {
print(2, argv[0], : unrecognized option: ',
argv[i], '\n, (char*)0);
continue;
}
for (j=1; argv[i][j]; j++) switch (argv[i][1]) {
case 'o':
if (argv[i][j+1]) o_enc = argv[i]+j+1;
else if (i+1  argc) o_enc = argv[++i];
else print(2, argv[0],
: outer encoding omitted\n, (char *)0);
break;
case 'i':
if (argv[i][j+1]) i_enc = argv[i]+j+1;
else if (i+1  argc) i_enc = argv[++i];
else print(2, argv[0],
: inner encoding omitted\n, (char *)0);
break;
case 'e':
if (argv[i][j+1]) argv[i] += j+1;
else if (i+1  argc) i++;
else print(2, argv[0],
: command omitted, using SHELL\n, (char *)0);
/* null terminate our exec arglist */
for (j=0; jargc-i; j++)
argv[j] = argv[j+i];
argv[j] = 0;
cmd = argv;
}
}

itoo = iconv_open(o_enc, i_enc);
otoi = iconv_open(i_enc, o_enc);
if (!itoo || !otoi) {
print(2, argv[0], : failed to open iconv between ,
o_enc,  and , i_enc, \n, (char*)0);
goto die;
}

if ((pty = posix_openpt(O_RDWR|O_NOCTTY))  0
  || grantpt(pty)  0 || unlockpt(pty)  0) {
print(2, argv[0], : failed to get pty: ,
strerror(errno), \n, (char *)0);
goto die;
}

switch(fork()) {
case -1:
print(2, argv[0], : failed to fork child: ,
strerror(errno), \n, (char *)0);
goto die;
case 0:
setsid();
i = open(ptsname(pty), O_RDWR);
close(pty);
dup2(i, 0);
dup2(i, 1);
dup2(i, 2);
if (i  2) close(i);
if (cmd) execvp(cmd[0],

Re: c++ strings and UTF-8 (other charsets)

2007-02-25 Thread Marcel Ruff


Rich Felker wrote:

On Sat, Feb 24, 2007 at 06:13:37PM +0100, Julien Claassen wrote:
  

Hi!
  What I meant about UTF-8-strings in c++: I mean in c and c++ they're not 
standard like in Java.



UTF-16, used by Java, is also variable-width. It can be either 2 bytes
or 4 bytes per character. Support for the characters that use 4 bytes
is generally very poor due to the misconception that it's
fixed-width.. :(

  
I think UTF-8 is a variable width multibyte charset, so 
there are specific problems in handling them allocating the right space. I 
mean the Glib contains something like UString and QT has its QStrings, which 
I think are also UTF-8 capable.


As far as i know:

Using UTF-8 in C or C++ is very simple:
As UTF-8 may not contain '\0' you can simply use all
functions as before (strcmp(), std::string etc.).
Old code doesn't need to be ported.

The only place to take care is when interfacing other libraries
using wchar_t and such (UTF-16, UTF-32), here
you need to convert using functions like wcstrtombs(), mbstrtowcs(), 
mbrtowc() and such.

This works well on Linux, Windows or other OS,

Marcel


All strings are UTF-8 capable; the unit of data is simply bytes
instead of characters. If you're looking for a class that treats
strings as a sequence of abstract characters rather than a sequence of
bytes, you could look for a library to do this or write your own.
However I suspect the most useful way to do this on C++ would be to
extend whatever standard byte-based string class you're using with a
derived class.

Maybe there's something like this built in to the C++ STL classes
already that I'm not aware of. As I said I don't know much of (modern)
C++. Can someone who knows the language better provide an answer?

It would also be easier to provide you answers if we knew better what
you're trying to do with the strings, i.e. whether you just need to
store them and spit them back in output, or whether you need to do
higher-level unicode processing like line breaks, collation,
rendering, etc.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/


  



--
Linux-UTF8:   i18n of Linux on all levels
Archive:  http://mail.nl.linux.org/linux-utf8/

Re: c++ strings and UTF-8 (other charsets)

Re: A call for fixing aterm/rxvt/etc...

Re: c++ strings and UTF-8 (other charsets)

3 matches

Site Navigation

Mail list logo

Footer information