Markus> We write an editor. You are provided with a buffer representing
    Markus> the line up to the cursor position. Specify an efficient algorithm
    Markus> that determines, how many bytes have to be removed from the end of
    Markus> that buffer if the user presses backspace to remove one character.

    Markus> There are many intuitive but wrong solutions, where people just
    Markus> look at the last one, two, or three bytes. There is a trivial
    Markus> correct solution that scans the entire line from the beginning
    Markus> (which is what Microsoft actually implemented in Windows). Is
    Markus> there a better solution that scans the line from the end only as
    Markus> far as necessary? How does it look like?  How far do you have to
    Markus> scan?

A few years ago, I used a variation of this problem as a class assignment in a
natural language processing course.  The problem was to search for a string in
some text and determine if the match started on a proper "character" boundary.
This was for cases of searching text containing one of the eight-bit forms of
GB2312.1980, JISX0208.1983, KSC5601.1987, or Big5 and possibly ASCII.

The solution turns out to be elegantly simple and reasonably efficient, with
one performance enhancement needed to improve handling of cases where the text
is all 8-bit GB2312, JISX0208, or KSC5601.

Part of the answer is simply to look backward until you encounter a 7-bit
character or the beginning of the buffer after a match occurs.  The rest of
the answer I will leave as an exercise for the reader :-)

BTW, Wong Kam-Fai of the Chinese University in Hong Kong did a related paper
in his "String Matching on Chinese/English Mixed Texts."  Though I did send
him my simplifications of his idea in the form of code, I don't know if they
ever made it into any of his later publications.

And as Markus indicates, UTF-8 makes the solution even simpler.
-----------------------------------------------------------------------------
Mark Leisher                      Times are bad.  Children no longer obey
Computing Research Lab            their parents, and everyone is writing
New Mexico State University       a book.
Box 30001, Dept. 3CRL                -- Marcus Tullius Cicero
Las Cruces, NM  88003
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/

Reply via email to