Tomohiro KUBOTA wrote on 2001-05-14 10:01 UTC:
> previous_character = character_in_focus - 1;
>
> This is not an appropriate code in internationalization context.
> To be encoding-independent, previous_character must be calculated
> from the top of the buffer, though there are encoding-dependent
> way to avoid such machine-power-consuming way for some encodings
> such as UTF-8.
A friend of mine went to a job interview at Microsoft recently.
They give you a number of brain teaser questions. One he got
was the following:
In some CJK encoding, a single character can be either
- a byte in the range 0x20-0x7f
- a byte in the range 0x80-0xff followed by a byte in the range 0x20-0xff
We write an editor. You are provided with a buffer representing the
line up to the cursor position. Specify an efficient algorithm that
determines, how many bytes have to be removed from the end of that
buffer if the user presses backspace to remove one character.
There are many intuitive but wrong solutions, where people just look at
the last one, two, or three bytes. There is a trivial correct solution
that scans the entire line from the beginning (which is what Microsoft
actually implemented in Windows). Is there a better solution that scans
the line from the end only as far as necessary? How does it look like?
How far do you have to scan?
Think about it. Makes you really appreciate the design of UTF-8
afterwards ...
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/