Re: Additional multi-byte functions

Bruno Haible Mon, 14 May 2001 09:52:25 -0700
Markus Kuhn writes:
> > And of course, it should only look at the tail of the
> > file, especially if it's a very large file.
> 
> With the strictly-portable ISO C 99 mb* functions on the other hand,
> I can't see immediately any alternative to parsing the entire 5 GB
> file from the very beginning.

The life-saving trick is that you can identify newline characters, as
single bytes, independently of the encoding. So "tail" reads the file
from the end, going backwards, and parses line after line. Since lines
are typically short, it doesn't read from disk a lot more bytes than
really needed.

> I hope you tested in the end with nl_langinfo(CODESET) the name of the
> encoding and provided efficient special solutions for the "ISO-8859-*"
> and "UTF-8" cases ...

For ISO-8859-*, yes: when MB_CUR_MAX == 1, the program behaves the
traditional way. For UTF-8, no, because the gain of special casing it
would not be large.

> Sidenote: What would users be more interested in:
> 
>   a) print the last n characters?
> 
>   b) print the last n base characters plus their associated
>      combining characters?
> 
>   c) print the last n column positions?

Good question. For "tail" I provide only a). b) and c) are only
provided for utilities where the aethetics of the output are
traditionally important, for example "cut" and "fold".

> Perhaps suffixes have to be added in the case of head/tail to
> distinguish these alternatives: 5k = 5000 bytes, 72n = 72 columns, 5mr =
> 5000000 characters, etc.

Only the "head" and "tail" programs have so clumsy command line option
conventions; others, like "wc", have more getopt() like options.

Bruno
-
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/lists/
Re: Additional multi-byte functions

Reply via email to