On Windows NTFS (and LFN extension of FAT32 and exFAT) at least, random sequences of 16-bit code units are not permitted. There's visibly a validation step that returns an error if you attempt to create files with invalid sequences (including other restrictions such as forbidding U+0000 and some other problematic controls).
This occurs because the NTFS and FAT driver will also attempt to normalize the string in order to create compatibility 8.3 filenames using the system's native locale (not the current user locale which is used when searching files/enumerating directories or opening files - this could generate errors when the encodings for distinct locales do not match, but should not cause errors when filenames are **first** searched in their UTF-16 encoding specified in applications, but applications that still need to access files using their short name are deprecated). The kind of normalization taken for creating short 8.3 filenames uses OS-specific specific conversion tables built in the filesystem drivers. This generation however has a cost due to the uniqueness constraints (requiring to abbreviate the first part of the 8.3 name to add "~numbered" suffixes before the extension, whose value is unpredicatable if there are other existing "*~1.*" files: it requires the driver to retry with another number, looping if necessary). This also has a (very modest) storage cost but it is less critical than the enumeration step and the fact that these shortened name cannot be predicted by applications. This canonicalization is also required also because the filesystem is case-insensitive (and it's technically not possible to store all the multiple case variants for filenames as assigned aliases/physical links). In classic filesystems for Unix/Linux the only restrictions are on forbidding null bytes, and assigning "/" a role for hierarchic filesystems (unusable anywhere as directory entry name), plus the preservation of "." and ".." entries in directories, meaning that only 8-bit encodings based on 7-bit ASCII are possible, so Linux/Unix are not completely treating thes filenames as pure binary bags of bytes (however if this is not checked and such random names may occur, which will be difficult to handle with classic tools and shells). Some other filesystems for Linux/Unix are still enforcing restrictions (and there exists even versions of them that are supporting case insensitity, in addition to FAT12/FAT16/FAT32/exFAT/NTFS emulated filesystems: this also exists in NFS driver as an option, or in drivers for legacy filesystems initially coming from mainframes, or in filesystem drivers based on FTP, and even in the filesystem driver allowing to mount a Windows registry which is also case-insensitive). Technically in the core kernel of Linux/Unix there's no restriction on the effective encoding (except "/" and null), the actual restrictions are implemented within filesystem drivers, configured only when volumes are mounted: each mounted filesystem can then have its own internal encoding; there will be different behaviors when using a driver for any MacOS filesystem. Linux can perfectly work with NTFS filesystems, except that most of the time, short filenames will be completely ignored and not generated on the fly. This generation of short filenames in a legacy (unspecified) 8-bit codepage is not a requirement of NTFS and it can be disabled also in Windows. But FAT12/FAT16/FAT32 still require these legacy short names to be generated when only the LFN could be used, and the short 8.3 name left completely null in the main directory entry ; but legacy FAT drivers will shoke on these null entries, if they are not tagged by a custom attribute bit as "ignorable but not empty", or if the 8+3 characters do not use specific unique parterns such as "\" followed by 7 pseudo-random characters in the main part, plus 3 other pseudo-random characters in the extension (these 10 characters may use any non null value: they provide nearly 80 bits or more exactly 250^10 identifiers if we exclude the 6 characters "/", "\", ".", ":" NULL and SPACE that are reserved, which could be generated almost predictably simply by hashing the original unabbreviated name with 79 bits from SHA-128, or faster with simple MD5 hahsing, and very rare remaining collisions to handle). Some FAT repait tools will attempt to repair the legacy short filenames that are not unique or cannot be derived from the UTF-16 encoded LFN (this happens when "repairing" a FAT volume initially created on another system that used a different 8-bit OEM codepage, but this "CheckDisk" tools should have an option to not "repair" them, given that modern applications normally do not need these filenames if a LFN is present (even the Windows Explorer will not display these short names because trhey are hidden by default each time there's a LFN which overrides them). We must add however that on FAT filesystems, a LFN will not always be stored if the Unicode name already has the "8.3" form and all characters are from ASCII (which is the base of all supported 8-bit OEM charsets), but it will be created if the user edits the filename to use another prefered capitalization than the default one (the Explorer default is to render fully capitlized short filenames using a single leading capital letter, and all other characters, including the 1-to-3-characters file extension, befing displayed as lowercase (so the "Windows" LFN would be stored simply as the "WINDOWS" short filename without any LFN needed in the directory entries). To be complete, a few legacy filenames are also reserved and can't be used in Windows (short of LFN) filenames, such as "CON" (case-insensitive), reserved by another legacy non-filesystem driver before they are seeked in a specific current directory: to use them as filenames, you must prefix them with a drive letter or with the some ".\" prefix (relative to the current directory) or full path name. 2017-05-16 17:44 GMT+02:00 Hans Åberg <haber...@telia.com>: > > > On 16 May 2017, at 17:30, Alastair Houghton via Unicode < > unicode@unicode.org> wrote: > > > > On 16 May 2017, at 14:23, Hans Åberg via Unicode <unicode@unicode.org> > wrote: > >> > >> You don't. You have a filename, which is a octet sequence of unknown > encoding, and want to deal with it. Therefore, valid Unicode > transformations of the filename may result in that is is not being > reachable. > >> > >> It only matters that the correct octet sequence is handed back to the > filesystem. All current filsystems, as far as experts could recall, use > octet sequences at the lowest level; whatever encoding is used is built in > a layer above. > > > > HFS(+), NTFS and VFAT long filenames are all encoded in some variation > on UCS-2/UTF-16. ... > > The filesystem directory is using octet sequences and does not bother > passing over an encoding, I am told. Someone could remember one that to > used UTF-16 directly, but I think it may not be current. > > >