New submission from Barney Gale <barney.g...@gmail.com>:

Capturing a write-up by eryksun on GitHub into a new bug.

Link: https://github.com/python/cpython/pull/25264#pullrequestreview-631787754

> `nt._getfinalpathname()` opens a handle to a file/directory with 
> `CreateFileW()` and calls `GetFinalPathNameByHandleW()`. The latter makes a 
> few system calls to get the final opened path in the filesystem (e.g. 
> "\Windows\explorer.exe") and the canonical DOS name of the volume device on 
> which the filesystem is mounted (e.g. "\Device\HarddiskVolume2" -> "\\?\C:") 
> in order to return a canonical DOS path (e.g. "\\?\C:\Windows\explorer.exe").
> 
> Opening a handle with `CreateFileW()` entails first getting a fully-qualified 
> and normalized NT path, which, among other things, entails resolving ".." 
> components naively in the path string. This does not take reparse points such 
> as symlinks and mountpoints into account. The only time Windows parses ".." 
> components in an opened path the way POSIX does is in the kernel when they're 
> in the target path of a relative symlink.
> 
> `nt.readlink()` opens a handle to the file with the flag 
> `FILE_FLAG_OPEN_REPARSE_POINT`. If the final path component is a reparse 
> point, it opens it instead of traversing it. Then the reparse point is read 
> with the filesystem control request, `FSCTL_GET_REPARSE_POINT`. System 
> symlinks and mountpoints (`IO_REPARSE_TAG_SYMLINK` and 
> `IO_REPARSE_TAG_MOUNT_POINT`) are the only supported name-surrogate 
> reparse-point types, though `os.stat()` and `os.lstat()` handle all 
> name-surrogate types as 'links'. Moreover, only symlinks get the `S_IFLNK` 
> mode flag in a stat result, because they're the only ones we can create with 
> `os.symlink()` to satisfy the usage `if os.path.islink(src): 
> os.symlink(os.readlink(src), dst)`.
>
> > What would it take to do a POSIX-style "normalize as we resolve",
> > and would we want to? I guess we'd need to call nt._getfinalpathname()
> > on each path component in turn (C:, C:\Users, C:\Users\Barney etc),
> > which from my pretty basic Windows knowledge might be rather slow if
> > that involves file handles.
> 
> You asked, so I decided to write up an outline of what implementing a 
> POSIX-style `realpath()` might look like in Windows. At its core, it's 
> similar to POSIX: lstat(), and, for a symlink, readlink() and recur. The 
> equivalent calls in Windows are the following:
> 
>     * `CreateFileW()` (open a handle)
> 
>     * `GetFileInformationByHandleEx()`: `FileAttributeTagInfo`
> 
>     * `DeviceIoControl()`: `FSCTL_GET_REPARSE_POINT`
> 
> 
> A symlink has the reparse tag `IO_REPARSE_TAG_SYMLINK`.
> 
> Filesystem mountpoints (aka junctions, which are like Unix bind mountpoints) 
> must be retained in the resolved path in order to correctly resolve relative 
> symlinks such as "\spam" (relative to the resolved device) and "..\..\spam". 
> Anyway, this is consistent with the UNC case, since mountpoints on a remote 
> server can never be resolved (i.e. a final UNC path never resolves 
> mountpoints).
> 
> Here are some of the notable differences compared to POSIX:
> 
>     * If the source path is not a "\\?\" verbatim path, `GetFullPathNameW()` 
> must be called initially.  However, ".." components in the target path of a 
> relative symlink must be resolved the POSIX way, else symlinks in the target 
> path may be removed incorrectly before their target is resolved (e.g. 
> "foo\symlink\..\bar" incorrectly resolved as "foo\bar"). The opened path is 
> initially normalized as follows:
>       
>       * replace forward slashes with backslashes
>       * collapse repeated backslashes (except the UNC root must have exactly 
> two backslashes)
>       * resolve a relative path (e.g. "spam"), drive-relative path (e.g. 
> "Z:spam"), or rooted path (e.g. "\spam") as a fully-qualified path (e.g. 
> "Z:\eggs\spam")
>       * resolve "." and ".." components in the opened path (naive to symlinks)
>       * strip trailing spaces and dots from the final component (e.g. 
> "C:\spam. . ." -> "C:\spam")
>       * resolve reserved device names in the final component of a non-UNC 
> path (e.g. "C:\nul" -> "\\.\nul")
> 
>     * Substitute drives (e.g. created by "subst.exe", or `DefineDosDeviceW`) 
> and mapped drives (e.g. created by "net.exe", or `WNetAddConnection2W`) must 
> be resolved, respectively via `QueryDosDeviceW()` and 
> `WNetGetUniversalNameW()`. Like all DOS 'devices', these drives are 
> implemented as object symlinks (i.e. symlinks in the object namespace, not to 
> be confused with filesystem symlinks). The target path of these drives, 
> however, is not a Device object, but rather a filesystem path on a device 
> that can include any number of path components, some of which may be 
> filesystem symlinks that need to be resolved. Normally when a path is opened, 
> the system object manager reparses all DOS 'devices' to the path of an actual 
> Device object, or a path on a Device object, before the I/O manager's parse 
> routine ever sees the path. Such drives need to be resolved whenever parsing 
> starts or restarts at a drive, but the result can be cached in case multiple 
> filesystem symlinks target the same drive
 .
>       
>       * Substitute drives can target paths on other substitute drives, so 
> `QueryDosDeviceW()` has to be called in a loop that accumulates the tail path 
> components until it reaches a real device (i.e. a target path that doesn't 
> begin with "\??\").
>       * `WNetGetUniversalNameW()` has to be called after resolving substitute 
> drives. It resolves the underlying UNC  path of a mapped drive. The target 
> path of the object symlink that implements a mapped drive is of the form 
> "\Device\<redirector device 
> name>\;<something>\server\share\some\filesystem\path". The "redirector device 
> name" component is usually (post Windows Vista) an object symlink to a path 
> on the system's Multiple UNC Provider (MUP) device, "\Device\Mup". The 
> mapped-drive target path ultimately resolves to a redirected filesystem 
> that's mounted in the MUP device namespace at the "share" name. This is an 
> implementation detail of the filesystem redirector and MUP device, which the 
> Multiple Provider Router (MPR) WNet API encapsulates. For example, for the 
> mapped drive path "Z:\spam\eggs", it returns a UNC path of the form 
> "\\server\share\some\filesystem\path\spam\eggs".
> 
>     * A join that tries to resolve ".." against the drive or share root path 
> must fail, whereas this is ignored for the root path in POSIX. For example, 
> `symlink_join("C:\\", "..\\spam")` must fail, since the system would fail an 
> open that tried to reparse that symlink target.
> 
>     * At the end, the resolved path should be tested to try to remove "\\?\" 
> if the source path didn't have this prefix. Call `GetFullPathNameW()` to 
> check for a reserved name in the final component and 
> `PathCchCanonicalizeEx()` to check for long-path support. (The latter calls 
> the system runtime library function `RtlAreLongPathsEnabled`, but that's an 
> undocumented implementation detail.)
> 
> 
> `GetFinalPathNameByHandleW()` is not required. Optionally, it can be called 
> for the last valid component if the caller wants a final path with all 
> mountpoints resolved, i.e. add a `final_path=False` option. Of course, a 
> final UNC path must retain mountpoints, so there's nothing we can do in that 
> case. It's fine that this `realpath()` implementation would return a path 
> that contains mountpoints in Windows (as the current implementation also does 
> for UNC paths). They are not symlinks, and this matches the behavior of POSIX.
> 
> I'd include a warning in the documentation that getting a final path via 
> `GetFinalPathNameByHandleW()` in the non-strict case may be dysfunctional. 
> The unresolved tail end of the path may become valid again if a server or 
> device comes back online. If the unresolved part contains symlinks with 
> relative targets such as "\spam" and "..\..\spam", and the `realpath()` call 
> resolved away mountpoints, the reminaing path may not resolve correctly 
> against the final path, as compared to how it would resolve against the 
> original path. It definitely will not resolve the same for a rooted target 
> path such as "\spam" if the last resolved reparse point in the original path 
> was a mountpoint, since it will reparse to the root path of the mountpoint 
> device instead of the original opened device, or instead of the last resolved 
> device of a symlink in the path.

----------
components: Library (Lib)
messages: 391804
nosy: barneygale
priority: normal
severity: normal
status: open
title: os.path.realpath() normalizes paths before resolving links on Windows
versions: Python 3.10, Python 3.11, Python 3.6, Python 3.7, Python 3.8, Python 
3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue43936>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to