Followup to:  <[EMAIL PROTECTED]>
By author:    Pablo Saratxaga <[EMAIL PROTECTED]>
In newsgroup: linux.utf8
> 
> > There is a fairly important point here: POSIX requires that a filename
> > can consist of any sequence of bytes other than '/' and '\0'.  Valid
> 
> I read this as "the system must be able to manipulate them", not as
> "use of any arbitrary and meaningless flow of bytes is encouraged".
> 

Right, of course.

> > UTF-8 sequences are a subset of this.  This presumably means that
> > illegal UTF-8 sequences should be accepted by the server as valid
> > filenames;
> 
> yes.
> 
> > This *DOES*, however, have implications for applications that use
> > filenames, such as the shell or ls.  Traditionally, nonreadable
> > filenames have been displayed using "escape codes".  With UTF-8, there
> > are now two levels of "nonreadableness":
> > 
> > a) Those that don't correspond to any valid UTF-8 sequences.
> >    \xc2\x7f would be such a sequence.
> > 
> > b) Those that don't correspond to a displayable Unicode.
> >    \xf3\xa1\x88\xb4 a.k.a. \U000E1234 would be such a sequence.
> > 
> > Since it is very important that the shell can access any file that can
> > exist in the system, I believe there should be a standard (formal or
> > informal) proposed for how to display these escape codes.
> 
> Imho it just has to be done the same way as it currently is.
> 
> Currently yo ucan have a filename with bytes in 0x01-0x1F and 0x7F-0x9F,
> however you cannot usually type those directly.
> Well, you can use those \x88 and the like representations, or use
> that lovely tab-completion feature (if the filename starts with
> a typable thing), or use a tool that allows you to pick the
> file in a menu (that is my preferred way to delete "bizarre" file names:
> select them in "mc" and press F8; it is much easier)
> 

The main difference is that there are byte sequences that don't match
any existing wide characters.  On the other hand, valid UTF-8
sequences I would like to see decoded; after all, it might simply be
that I just don't have the font to display them, or that it's for a
newer version of Unicode than I have on my system.

A fairly easy way to deal with this is to set up your
multibyte->widechar converter to give you a recognizable invalid
widechar for each byte of invalid UTF-8.  For example, use 0x7fffffxx
for this purpose.  Then the byte sequence:

    37 cc 93 ca e0 9f bf df bf

... could be converted to ...

    00000037 00000313 7fffffca 7fffffe0 7fffff9f 7fffffbf 000007ff

... and displayed as ...
    ,
    7\xca\xe0\x9f\xbf\u07ff

        -hpa

-- 
<[EMAIL PROTECTED]> at work, <[EMAIL PROTECTED]> in private!
"Unix gives you enough rope to shoot yourself in the foot."
http://www.zytor.com/~hpa/puzzle.txt    <[EMAIL PROTECTED]>
--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to