> -----Original Message-----
> From: Fleshgrinder [mailto:p...@fleshgrinder.com]
> Sent: Saturday, April 1, 2017 2:43 PM
> To: Anatol Belski <weltl...@outlook.de>; Rasmus Schultz
> <ras...@mindplay.dk>
> Cc: PHP internals <internals@lists.php.net>
> Subject: Re: [PHP-DEV] Directory separators on Windows
> 
> On 4/1/2017 2:01 PM, Anatol Belski wrote:
> > 1. optionally - yes, otherwise it should do platform default 2. no,
> > this kind of operation is a pure parsing, no I/O related checks needed
> > 3. irrelevant, but can be defined
> >
> > Other points yet I'd care about
> > - result should be correct for target platform disregarding actual 
> > platform, fe
> target Linux path Windows, or Windows path on Mac, etc.
> > - validation, particularly for reserved words and chars, also other
> > platform aspects
> > - encodings have to be respected, or UTF-8 only, to define
> > - probably should be compatible with PHP stream wrapper namespaces
> >
> >
> > Thanks
> >
> > Anatol
> >
> 
> 1. How do you envision that? If the path is `/a/b/../c` where only `/a` 
> exists right
> now? It's unresolvable, assuming that `../` points to `/a` is wrong if `b/` 
> is a
> symbolic link that points to `/x/y`.
> 
> 2. Here I agree, casing cannot be decided without hitting the filesystem. Some
> are case-sensitive, some insensitive, and others configurable.
> 
Basically, it is the same as your points 8., 9. and 10. - it deals with the 
given path itself, so no symlinks, etc. In the snippet /a/b/../c it's parsed 
like follows

- parse up to /a/b/../
- scroll back to /a
- append the remain so it becomes /a/c

Similar process is with /a/./b would become /a/b and others. It is string 
traversing only. What is done with dirname() uses this approach. In general one 
can say - normalization is a path simplification, no drive access like 
realpath() does. For example, it lets to know the path itself would be correct 
before it comes to actual file operation, and not bother with I/O otherwise. 

> 3. Does not matter for Windows itself, it is case-insensitive.
> 
> (I continue the numbering for the points you raised.)
> 
> 4. How would we go about normalizing a Windows path to POSIX? `C:\a` is not
> necessarily the same as `/a`, or should it produce `C:/a`?
>
As mentioned in an earlier post, in might make sense to have flags to control 
the behavior. Maybe a signature like

string canonicalize_path(string $path, int $flags = 0);

The function OFC knows the current platform. Flags like PATH_TARGET_WINDOWS | 
PATH_UNIXIFY would control the path separator behaviors. Generally, regarding 
path without drive letter - on Windows I'd strongely advise to not to use it in 
configs, etc. because of multiple root issues mentioned already. But in 
principle, say one has same FS structure on different platforms and just wants 
to mirror it, that would be ok with flags like PATH_TARGET_LINUX | 
PATH_STRIP_DRIVE as Linux implies forward slashes. Or otherwise, fe the reverse 
case - generating a path on Linux that is to be used on Windows, flags might 
contain only PATH_TARGET_WINDOWS which would produce backslashes as system 
default. Maybe that's too much or unrelated, and only platform targets should 
be provided, dunno, just a mind game for now.

> 5. ๐Ÿ‘
> 
> 6. I vote for UTF-8 only. We already have locale dependent filesystem 
> functions,
> which also makes them kind of weird to use, especially in libraries. Another 
> very
> important aspect to take care of this point is normalization forms. 
> Filesystems
> generally store stuff as is, that means that we can create to files with the 
> same
> name, at least by the looks of it, which are actually different ones. Think 
> of `รค`
> which can also be `aฬˆ`. It is generally most advisable to stick to NFC, 
> because that
> is also how users usually produce those chars.
> 
Yeah, probably UTF-8 were the simplest for the cross platform implementation. 
Regarding the encoding variant - that's where more care would be needed. Fe see 
https://github.com/aws/aws-cli/issues/1639 , that's where we would care about 
PATH_TARGET_MAC specific things. Comparable, fe the situation, where you want 
to escapeshell* something, but it'll be invalid on another platform or possibly 
with another shell, how it currently works. 
> 7. ๐Ÿ‘ just forward I'd say.
> 
> 8. Collapse multiple separators (e.g. `a//b` ~> `a/b`).
> 
> 9. Resolve self-references, unless they are leading (e.g. `a/./b` ~> `a/b` but
> `./a/b` stays `./a/b`).
> 
> 10. Trim separators from the end (e.g. `a/` ~> `a`).
> 
These last 3 points, as well as above one, are canonicalization. Of course, in 
the imaginary function, it could be decoupled like PATH_NO_CANONIC if it's not 
wanted, or PATH_CANONICALIZE_ONLY to omit other conversions. It's only about to 
have the behaviors sensible. Fe possible other flags could be 
PATH_STRIP_TRAILING_SLASH, PATH_ALLOW_RELATIVE and other fine things. But by 
default, the function should do the default thing for the target platform, 
based on the current platform. Thus, producing NFD for Mac and NFC otherwise, 
backslash for Windows and forward slash otherwise, other thing that will for 
sure popup. As mentioned earlier, still this requires some re-implementations 
of the platform APIs, even we'd talk about slashes only - for ASCII paths I'm 
not sure we even can differentiate the UTF-8 encoding  forms without involving 
yet another library, so this might be tricky. Simply exposing the part of 
realpath() processing might solve several things for one given platform, that's 
for sure. The initial case Rasmus reported was about crossplatform handling, 
but the topic is indeed slightly bigger than just path separators, so IMO the 
convenient way were to care about a crossplatform approach. I've no info, how 
badly such crossplatform path issues are indeed relevant, so it might be 
another story to investigate before one starts any implementation. At least, 
grouping some cases and thought, maybe as an RFC, could be good to track the 
topic.

Thanks

Anatol

Reply via email to