Re: NFS4 requires UTF-8

Glenn Maynard Thu, 21 Feb 2002 18:19:14 -0800

On Fri, Feb 22, 2002 at 12:55:31AM +0100, Pablo Saratxaga wrote:
> > OTOH, the unprinting character problem is important.  Would it be
> > reasonable to escape (\u) characters with wcwidth(c)==0 (in tool output,
> > ie ls -b), or is there some reasonable use of them in filenames?
> 
> There are reasonable use of zwj and zwnj and similar, they are needed
> for proper writing in some languages.
> 
> In fact, all the trouble comes from the xterm, not from "ls".


If a filename is a BOM followed by "hello", how can I enter it?  I
don't expect my terminal emulator to remember all control characters
sent at any cursor position and paste them along with other characters,
so I'd end up pasting "hello" alone.  It's worse when the filename is
*only* unprinting characters, and there's nothing on screen to copy at
all.  (That's just plain confusing, too.)

We can't blame the terminal for not being able to copy and paste
arbitrary sequences of bytes.  It's not ls's "fault" either, per se (it's
inherent), but that doesn't mean it can't help.

> I would say that ls should not escape them, only invalid utf-8 and
> control chars.
> 
> then, another command line switch should be added to "escape all but
> printable ascii".

Well, I'd like all nonprinting characters escaped, but not, say, $BF|K\8l(B.
That means I can copy and paste the filename, and characters that *can*
be copied and pasted aren't escaped.  (but see below)

> more complex options are not to be done in the command line on an xterm,
> a graphical toolkit is more suited for that.

It's acceptable to go from "able to type all filenames with the
keyboard" to "need to copy and paste filenames which I can't type
directly".  That's reasonable (if only because it's unavoidable).  (As
has been pointed out, it's already there in ISO-8859-1.)

It's not acceptable to have filenames that I can't access from a CLI
(with C+P) reliably at all (or that I need to switch to a special ls mode
that escapes *everything* over ASCII to access.)  Wildcards are a useful
fallback, but they don't stand alone--it still wouldn't help me target a
file consisting only of control characters, for example.  Telling me to
"use a GUI" is simply no good.  (I'm not installing X on a 486 running
FTP to delete a file someone dumped in my /incoming.)

Files are an extremely fundamental part of a Unix system, and all fundamental
parts of Unix are accessible from a CLI.  That's always been one of its
greatest strengths, and we can't throw that away for filenames.  This is
why GNU ls supports escaping.

> the reason is that with ls/xterm the rendering and the tool handling the
> filenames are dissociated, so you cannot easily do interesting things,

ls supports escaping that matches bash's.  (\ooo, \xHH, \n, etc.) If this
is extended to include \uXXXX and \UXXXXXXXX, then ls can be extended to 
allow (optionally, for the sake of compatibility) displaying escape
characters, etc. in that form.

(I think that extension is useful, whether or not ls uses it.)

Just because the tools aren't maintained by the same person doesn't mean
there can't be cooperation.  (Though, considering how difficult it's
proving to be to get UTF-8 support at all in bash, I don't expect *all*
shells to support this.)

This doesn't involve xterm (or any terminal) at all, just the shell and
tools.

> So, the only interesting change that would be worth doing for the
> use of utf-8 in filenames will be an extra switch to ls to quote
> everything but ascii, and ensure it quotes incorrect utf-8 when the
> locale is in utf-8 mode.

I disagree; I think it's interesting, useful and practical to escape
certain other cases.  Leading combining characters, probably, and any
characters not useful in filenames.  (Of course, it's not necessarily
easy to determine what's useful.  I don't see BIDI support in filenames
as useful--that seems to be a property of whatever text is displaying
the filenames, not the filename themselves--but I'm not a BIDI user, so
I can only guess.)

I'm unclear on how control characters that change state behave in
filenames at all.  To pick a simple example, what if a filename contains
the language code "zh"?  I can no longer do a simple C program that
outputs "The first file is %s.  The second file is %s. [...]" as the
text after the first %s is marked Chinese.  (This probably won't break
anything, but other control characters probably would.)  Invalidate all
state after outputting a filename?  Complicated.  (I don't know what zwj
and zwnj do; perhaps a more practical example could be made with them.)
Anyone feel like filling me in here?

This would be like enbedding ANSI color sequences in filenames and ls
letting it through: the color would bleed onto the next line unless ls
knew to reset the color after each filename.

-- 
Glenn Maynard
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: NFS4 requires UTF-8

Reply via email to