Hi,
At Mon, 14 May 2001 11:14:25 +0100,
Markus Kuhn <[EMAIL PROTECTED]> wrote:
> There are many intuitive but wrong solutions, where people just look at
> the last one, two, or three bytes. There is a trivial correct solution
> that scans the entire line from the beginning (which is what Microsoft
> actually implemented in Windows). Is there a better solution that scans
> the line from the end only as far as necessary? How does it look like?
> How far do you have to scan?
>
> Think about it. Makes you really appreciate the design of UTF-8
> afterwards ...
Yes. UTF-8 has a good design to deal with this problem, at the
cost of more memory consumption. I think that almost Japanese
programmers have learnt Japanese text handling in early stage
and know very well about this problem. Rather, this is a daily
problem for Japanese programmers.
However, even if we know we have a good solution for UTF-8, we
must not implement the solution in mb/wc context.
There is a tradeoff of speed and memory consumption. If speed is
important, have another buffer to hold the number of bytes of each
character. When you store a new character, store the number of the
bytes of the character to the another buffer. When you remove the
last character, remove number of bytes specified by the another
buffer. When supported encoding is limited, the element of the
another buffer may be 1 bit (0 means 1-byte-character and 1 means
2-byte-character). On the other hand, if less memory consumption
is important, you will have to scan from the top of the buffer.
Another example of internationalization problem: The second
byte of multibyte character may have same value as '/', '\',
'$', or other characters which may be treated as an escape
character. It must be avoided that such byte is treated as
an escape character.
---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://www.debian.or.jp/~kubota/
"Introduction to I18N" http://www.debian.org/doc/manuals/intro-i18n/
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/