Re: Additional multi-byte functions

Markus Kuhn Mon, 14 May 2001 02:57:35 -0700
Tomohiro KUBOTA wrote on 2001-05-14 10:01 UTC:
>   previous_character = character_in_focus - 1;
> 
> This is not an appropriate code in internationalization context.
> To be encoding-independent, previous_character must be calculated
> from the top of the buffer, though there are encoding-dependent
> way to avoid such machine-power-consuming way for some encodings
> such as UTF-8.

A friend of mine went to a job interview at Microsoft recently.
They give you a number of brain teaser questions. One he got
was the following:

  In some CJK encoding, a single character can be either

    - a byte in the range 0x20-0x7f
    - a byte in the range 0x80-0xff followed by a byte in the range 0x20-0xff

  We write an editor. You are provided with a buffer representing the
  line up to the cursor position. Specify an efficient algorithm that
  determines, how many bytes have to be removed from the end of that
  buffer if the user presses backspace to remove one character.

There are many intuitive but wrong solutions, where people just look at
the last one, two, or three bytes. There is a trivial correct solution
that scans the entire line from the beginning (which is what Microsoft
actually implemented in Windows). Is there a better solution that scans
the line from the end only as far as necessary? How does it look like?
How far do you have to scan?

Think about it. Makes you really appreciate the design of UTF-8
afterwards ...

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Additional multi-byte functions

Reply via email to