On Sun, Mar 26, 2017 at 6:57 PM, Chris Angelico <ros...@gmail.com> wrote: > > In actual UCS-2, surrogates are entirely disallowed; in UTF-16, they *must* be > correctly paired.
Strictly-speaking UCS-2 disallows codes that aren't defined by the standard, but the kernel couldn't be that restrictive. Unicode was a moving target in the period that NT was developed (1988-93). The object manager simply allows any 16-bit code in object names, except its path separator, backslash. Since a UNICODE_STRING is counted, even NUL is allowed in object names. But that's uncommon and should be avoided since the user-mode API uses null-terminated strings. The file-system runtime library further restricts this by reserving NUL, ASCII control codes, forward slash, pipe, and the wildcard characters asterisk, question mark, double quote, less than, and greater than. The rules are loosened for NTFS named streams, which only reserve NUL, forward slash, and backslash. >> Windows file systems are also UCS-2. For the most part it's not an >> issue since the source of text and filenames will be valid UTF-16. > > I'm actually not sure on that one. Poking around on both Stack > Overflow and MSDN suggests that NTFS does actually use UTF-16, which > implies that lone surrogates should be errors, but I haven't proven > this. In any case, file system encoding is relatively immaterial; it's > file system *API* encoding that matters, and that means the > CreateFileW function and its friends: Sure, the file system itself can use any encoding, but Microsoft use a permissive UCS-2 in its file systems. The API uses 16-bit WCHARs, and except for a relatively small set of codes (assuming it uses the FsRtl), the system generally doesn't care about the values. Let's review the major actors. CreateFile uses the runtime library in ntdll.dll to fill in an OBJECT_ATTRIBUTES [1] with a UNICODE_STRING [2]. This is where the current-directory handle is set as the attributes RootDirectory handle for relative paths; where slash is replaced with backslash; and where weird MS-DOS rules are applied, such as DOS device names and trimming trailing spaces. Once it has a native object attributes record, it calls the real system call NtCreateFile [3]. In kernel mode this in turn calls the I/O manager function IoCreateFile [4], which creates an open packet and calls the object manger function ObOpenObjectByName. Now it's time for path parsing. In the normal case the system traverses several object directories and object symbolic links before finally arriving at an I/O device (e.g. \??\C: => \Global??\C: => \Device\HarddiskVolume2). Parsing the rest of the path is in the hands of the I/O manager via the Device object's ParseProcedure. The I/O manager creates a File object and an I/O request packet (IRP) for the major function IRP_MJ_CREATE [5] and calls the driver for the device stack via IoCallDriver [6]. If the device is a volume that's managed by a file-system driver (e.g. ntfs.sys), the file-system parses the remaining path to open or create the directory/file/stream and complete the IRP. The object manager creates a handle for the File object in the handle table of the calling process, and this handle value is finally passed back to the caller. [1]: https://msdn.microsoft.com/en-us/library/ff557749 [2]: https://msdn.microsoft.com/en-us/library/ff564879 [3]: https://msdn.microsoft.com/en-us/library/ff566424 [4]: https://msdn.microsoft.com/en-us/library/ff548418 [5]: https://msdn.microsoft.com/en-us/library/ff548630 [6]: https://msdn.microsoft.com/en-us/library/ff548336 The object manager only cares about its path separator, backslash, until it arrives at an object type that it doesn't manage, such as a Device object. If a file system uses the FsRtl, then the remaining path is subject to Windows file-system rules. It would be ill-advised to diverge from these rules. > I *think* it's the naive (and very common) hybrid of UCS-2 and UTF-16 It's just the way the system evolved over time. UTF-16 wasn't standardized until 1996 circa NT 4.0. Windows started integrating it around NT 5 (Windows 2000), primarily for the GUI controls in the windowing system that directly affect text processing for most applications. It was good enough to leave most of the lower layers of the system passively naive when it comes to UTF-16. -- https://mail.python.org/mailman/listinfo/python-list