Re: Additional multi-byte functions

Markus Kuhn Mon, 14 May 2001 07:02:06 -0700
Bruno Haible wrote on 2001-05-14 13:48 UTC:
> Markus Kuhn writes:
> 
> >   In some CJK encoding, a single character can be either
> > 
> >     - a byte in the range 0x20-0x7f
> >     - a byte in the range 0x80-0xff followed by a byte in the range 0x20-0xff
> 
> That's BIG5...
> 
> >   Specify an efficient algorithm that
> >   determines, how many bytes have to be removed from the end of that
> >   buffer if the user presses backspace to remove one character.
> 
> I had the same problem when making the "tail" program multibyte aware
> last week :-) "tail -m N" shall print the last N characters of the
> input file. And of course, it should only look at the tail of the
> file, especially if it's a very large file.

Nice textbook example for the sort of problems, complexities and
inefficiencies that pure encoding-independent programming based only on
the locale-dependent multi-byte functions of the C library can bring.
With UTF-8, this problem is pretty simple and efficient to solve,
whereas with mblen you would end up with ridiculous performance.
With the strictly-portable ISO C 99 mb* functions on the other hand,
I can't see immediately any alternative to parsing the entire 5 GB
file from the very beginning.

I hope you tested in the end with nl_langinfo(CODESET) the name of the
encoding and provided efficient special solutions for the "ISO-8859-*"
and "UTF-8" cases ...

Sidenote: What would users be more interested in:

  a) print the last n characters?

  b) print the last n base characters plus their associated
     combining characters?

  c) print the last n column positions?

Perhaps the last-n-bytes function is mostly used for binary files anyway
and text files are usually treated on a per-line basis. It's not always
clear if an old application counted bytes, whether we should turn this
functionality into

  - bytes (b)
  - characters (r)
  - base characters (c) 
  - columns (n)
  - all of these as alternatives

Perhaps suffixes have to be added in the case of head/tail to
distinguish these alternatives: 5k = 5000 bytes, 72n = 72 columns, 5mr =
5000000 characters, etc.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>

-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Additional multi-byte functions

Reply via email to