Bruno Haible wrote on 2001-05-14 13:48 UTC:
> Markus Kuhn writes:
>
> > In some CJK encoding, a single character can be either
> >
> > - a byte in the range 0x20-0x7f
> > - a byte in the range 0x80-0xff followed by a byte in the range 0x20-0xff
>
> That's BIG5...
>
> > Specify an efficient algorithm that
> > determines, how many bytes have to be removed from the end of that
> > buffer if the user presses backspace to remove one character.
>
> I had the same problem when making the "tail" program multibyte aware
> last week :-) "tail -m N" shall print the last N characters of the
> input file. And of course, it should only look at the tail of the
> file, especially if it's a very large file.
Nice textbook example for the sort of problems, complexities and
inefficiencies that pure encoding-independent programming based only on
the locale-dependent multi-byte functions of the C library can bring.
With UTF-8, this problem is pretty simple and efficient to solve,
whereas with mblen you would end up with ridiculous performance.
With the strictly-portable ISO C 99 mb* functions on the other hand,
I can't see immediately any alternative to parsing the entire 5 GB
file from the very beginning.
I hope you tested in the end with nl_langinfo(CODESET) the name of the
encoding and provided efficient special solutions for the "ISO-8859-*"
and "UTF-8" cases ...
Sidenote: What would users be more interested in:
a) print the last n characters?
b) print the last n base characters plus their associated
combining characters?
c) print the last n column positions?
Perhaps the last-n-bytes function is mostly used for binary files anyway
and text files are usually treated on a per-line basis. It's not always
clear if an old application counted bytes, whether we should turn this
functionality into
- bytes (b)
- characters (r)
- base characters (c)
- columns (n)
- all of these as alternatives
Perhaps suffixes have to be added in the case of head/tail to
distinguish these alternatives: 5k = 5000 bytes, 72n = 72 columns, 5mr =
5000000 characters, etc.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
-
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/lists/