On Wed, Feb 03, 2016 at 04:46:57PM +0100, Ansgar Burchardt wrote:
> I'm fairly sure POSIX requires that almost all garbage can be part of
> filenames[1].  Also, userland still doesn't default to UTF-8 when no
> LC_* variable is set.  This is what "ls" does then:
> 
> $ LC_ALL=C ls ~/Music
> ??????????????????????????????????????????????????????
> ?????????????????? ??????????????????????????????vs???????????????
> ???????????? Original Sound Track
> ...

Which is the closest you can get to the desired output without a
transliteration table (such as
https://github.com/kilobyte/kbtin/blob/master/translit.h) [1].
The C locale simply has no means to display such characters.

> (Which is pretty much the same as "ls" on non-UTF-8 filenames in an
> UTF-8 locale I mentioned in an earlier mail.)

Such characters are an encoding error, and thus ls can reasonably consider
them to be garbage.

>       At least a "filename" is defined as a byte string consisting of 
>       anything except \0 and /:
>       http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap0
> 3.html#tag_03_170
> 
>       There's also a "portable filename character set", but that is
>       only [A-Za-z0-9_.-] and mentioned in the definition of
>       "pathname": if only characters from the portable set are used
>       in the filename, the name is usable in all locales as a character
>       string, otherwise it's just a string.

Which means that implementations must accept at least the portable set, and
are free to decide whether to accept anything else, with the exception of
\0 and /.



[1]. Assuming you can guess the actual encoding, which you have no real way
to.
-- 
A tit a day keeps the vet away.

Reply via email to