On Wed, Feb 03, 2016 at 04:46:57PM +0100, Ansgar Burchardt wrote: > I'm fairly sure POSIX requires that almost all garbage can be part of > filenames[1]. Also, userland still doesn't default to UTF-8 when no > LC_* variable is set. This is what "ls" does then: > > $ LC_ALL=C ls ~/Music > ?????????????????????????????????????????????????????? > ?????????????????? ??????????????????????????????vs??????????????? > ???????????? Original Sound Track > ...
Which is the closest you can get to the desired output without a transliteration table (such as https://github.com/kilobyte/kbtin/blob/master/translit.h) [1]. The C locale simply has no means to display such characters. > (Which is pretty much the same as "ls" on non-UTF-8 filenames in an > UTF-8 locale I mentioned in an earlier mail.) Such characters are an encoding error, and thus ls can reasonably consider them to be garbage. > At least a "filename" is defined as a byte string consisting of > anything except \0 and /: > http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap0 > 3.html#tag_03_170 > > There's also a "portable filename character set", but that is > only [A-Za-z0-9_.-] and mentioned in the definition of > "pathname": if only characters from the portable set are used > in the filename, the name is usable in all locales as a character > string, otherwise it's just a string. Which means that implementations must accept at least the portable set, and are free to decide whether to accept anything else, with the exception of \0 and /. [1]. Assuming you can guess the actual encoding, which you have no real way to. -- A tit a day keeps the vet away.

