While reading more docs about paths today, I found that UNC paths do not seem to by possibly relative. This means that `\\?\.\something` and `\\?\C:something` are not valid UNC paths.

After reading on that, I did some experiments:

  Path             MSYS2 bash resolves to       CMD resolves to (working drive 
is C:)
  ----             ----------------------       
-------------------------------------
  C:               file not found               working directory on C:
  C:Users          file not found               `Users` in working directory on 
C:
  C:\              C:\                          C:\
  \                \ of working drive           C:\
  \\?              no such file or directory    incorrect syntax
  \\?\             invalid argument             incorrect syntax
  \\?\C            invalid argument             path not found
  \\?\C:           C:\                          not a recognized device
  \\?\C:\          C:\                          C:\
  \\?\C:Users      invalid argument             path not found
  \\?\C:\Users     C:\Users                     C:\Users


Basing on these results, I have updated the implementation quite a little:

1. For UNC paths, the host name specification no longer contains the terminating
   separator. This makes all UNC paths absolute.
2. DOS drive letters are allowed in the beginning of a non-UNC path, or when 
they
   immediate follow `\\?` or `\\.\`.
3. Paths relative to working directories of drives are only allowed in their 
non-UNC
   form. This means now `dirname("C:")` returns "C:", while `dirname("\\?\C:")`
   returns "\\?\C:\".


[1] 
https://learn.microsoft.com/en-us/dotnet/standard/io/file-path-formats#identify-the-path


--
Best regards,
LIU Hao

From 6b207987ea190d6a5e7376bef8b031dcc5790f22 Mon Sep 17 00:00:00 2001
From: LIU Hao <[email protected]>
Date: Sun, 26 Mar 2023 02:02:52 +0800
Subject: [PATCH] crt: Reimplement `dirname()` and `basename()`

It is necessary to re-implement these two functions because

1. They used to change the global locale and were subject to races
   with almost all stdio functions.
2. The previous `basename()` had a VLA and might effect stack overflows if
   the argument path was too long.
3. They used to produce erroneous results if the argument path was not in
   the default ANSI code page.  (I don't think this is a bug though, just
   a design flaw.)

According to Microsoft documentation about `fopen()` [1], paths are
interpreted with `CP_ACP` if `AreFileApisANSI()` returns true, and `CP_OEMCP`
otherwise. We had better follow that convention. UNC-ized DOS paths should
also be handled, but they cannot be relative [2].

[1] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=msvc-170
[2] 
https://learn.microsoft.com/en-us/dotnet/standard/io/file-path-formats#identify-the-path

Signed-off-by: LIU Hao <[email protected]>
---
 mingw-w64-crt/misc/dirname.c | 276 +++++++++++++++++++++++++++++++++++
 1 file changed, 276 insertions(+)

diff --git a/mingw-w64-crt/misc/dirname.c b/mingw-w64-crt/misc/dirname.c
index e69de29bb..e8c4c48af 100644
--- a/mingw-w64-crt/misc/dirname.c
+++ b/mingw-w64-crt/misc/dirname.c
@@ -0,0 +1,276 @@
+/**
+ * This file has no copyright assigned and is placed in the Public Domain.
+ * This file is part of the mingw-w64 runtime package.
+ * No warranty is given; refer to the file DISCLAIMER.PD within this package.
+ */
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+#include <stdlib.h>
+#include <libgen.h>
+#include <windows.h>
+
+/* A 'directory separator' is a byte that equals 0x2F ('solidus' or more
+ * commonly 'forward slash') or 0x5C ('reverse solidus' or more commonly
+ * 'backward slash'). The byte 0x5C may look different from a backward slash
+ * in some locales; for example, it looks the same as a Yen sign in Japanese
+ * locales and a Won sign in Korean locales. Despite its appearance, it still
+ * functions as a directory separator.
+ *
+ * A 'path' comprises an optional DOS drive letter with a colon, and then an
+ * arbitrary number of possibily empty components, separated by non-empty
+ * sequences of directory separators (in other words, consecutive directory
+ * separators are treated as a single one). A path that comprises an empty
+ * component denotes the current working directory.
+ *
+ * An 'absolute path' comprises at least two components, the first of which
+ * is empty.
+ *
+ * A 'relative path' is a path that is not an absolute path. In other words,
+ * it either comprises an empty component, or begins with a non-empty
+ * component.
+ *
+ * POSIX doesn't have a concept about DOS drives. A path that does not have a
+ * drive letter starts from the same drive as the current working directory.
+ *
+ * For example:
+ * (Examples without drive letters match POSIX.)
+ *
+ *   Argument                 dirname() returns        basename() returns
+ *   --------                 -----------------        ------------------
+ *   `` or NULL               `.`                      `.`
+ *   `usr`                    `.`                      `usr`
+ *   `usr\`                   `.`                      `usr`
+ *   `\`                      `\`                      `\`
+ *   `\usr`                   `\`                      `usr`
+ *   `\usr\lib`               `\usr`                   `lib`
+ *   `\home\\dwc\\test`       `\home\\dwc`             `test`
+ *   `\\host\usr`             `\\host\.`               `usr`
+ *   `\\host\usr\lib`         `\\host\usr`             `lib`
+ *   `\\host\\usr`            `\\host\\`               `usr`
+ *   `\\host\\usr\lib`        `\\host\\usr`            `lib`
+ *   `C:`                     `C:.`                    `.`
+ *   `C:usr`                  `C:.`                    `usr`
+ *   `C:usr\`                 `C:.`                    `usr`
+ *   `C:\`                    `C:\`                    `\`
+ *   `C:\\`                   `C:\`                    `\`
+ *   `C:\\\`                  `C:\`                    `\`
+ *   `C:\usr`                 `C:\`                    `usr`
+ *   `C:\usr\lib`             `C:\usr`                 `lib`
+ *   `C:\\usr\\lib\\`         `C:\\usr`                `lib`
+ *   `C:\home\\dwc\\test`     `C:\home\\dwc`           `test`
+ */
+
+struct path_info
+  {
+    /* This points to end of the UNC prefix and drive letter, if any.  */
+    char* prefix_end;
+
+    /* These point to the directory separator in front of the last non-empty
+     * component.  */
+    char* base_sep_begin;
+    char* base_sep_end;
+
+    /* This points to the last directory separator sequence if no other
+     * non-separator characters follow it.  */
+    char* term_sep_begin;
+
+    /* This points to the end of the string.  */
+    char* path_end;
+  };
+
+#define IS_DIR_SEP(c)  ((c) == '/' || (c) == '\\')
+
+static
+void
+do_get_path_info(struct path_info* info, char* path)
+  {
+    char* pos = path;
+    DWORD cp;
+    int dbcs_tb, dir_sep, dos_dev;
+
+    /* Get the code page for paths in the same way as `fopen()`.  */
+    cp = AreFileApisANSI() ? CP_ACP : CP_OEMCP;
+
+    /* Set the structure to 'no data'.  */
+    info->prefix_end = NULL;
+    info->base_sep_begin = NULL;
+    info->base_sep_end = NULL;
+    info->term_sep_begin = NULL;
+
+    /* Check for a UNC prefix.  */
+    if(IS_DIR_SEP(pos[0]) && IS_DIR_SEP(pos[1])) {
+      pos += 2;
+      info->prefix_end = pos;
+
+      /* Seek to the end of the host name.  */
+      dbcs_tb = 0;
+      while(*pos != 0) {
+        dir_sep = 0;
+
+        if(dbcs_tb)
+          dbcs_tb = 0;
+        else if(IsDBCSLeadByteEx(cp, *pos))
+          dbcs_tb = 1;
+        else
+          dir_sep = IS_DIR_SEP(*pos);
+
+        if(dir_sep)
+          break;
+
+        pos ++;
+      }
+
+      if(*pos == 0) {
+        /* Only a host name exists.  */
+        info->prefix_end = pos;
+        info->path_end = pos;
+        return;
+      }
+
+      /* Host name terminates here. The terminating directory separator is
+       * not part of the prefix, and initiates a new absolute path.  */
+      info->prefix_end = pos;
+    }
+
+    /* A DOS drive letter may follow a `\\.\` or `\\?\` prefix in a UNC path,
+     * or initiate a non-UNC path.  */
+    dos_dev = 0;
+
+    if(pos - path == 3 && (path[2] == '.' || path[2] == '?')) {
+      pos ++;
+      dos_dev = 1;
+    }
+    else if(pos == path)
+      dos_dev = 1;
+
+    if(dos_dev && ((pos[0] >= 'A' && pos[0] <= 'Z')
+                   || (pos[0] >= 'a' && pos[0] <= 'z')) && pos[1] == ':') {
+      pos += 2;
+      info->prefix_end = pos;
+    }
+
+    /* The remaining part of the path is almost the same as POSIX.  */
+    dbcs_tb = 0;
+    while(*pos != 0) {
+      dir_sep = 0;
+
+      if(dbcs_tb)
+        dbcs_tb = 0;
+      else if(IsDBCSLeadByteEx(cp, *pos))
+        dbcs_tb = 1;
+      else
+        dir_sep = IS_DIR_SEP(*pos);
+
+      /* If a separator has been encountered and the previous character
+       * was not, mark this as the beginning of the terminating separator
+       * sequence.  */
+      if(dir_sep && !info->term_sep_begin)
+        info->term_sep_begin = pos;
+
+      /* If a non-separator character has been encountered and a previous
+       * terminating separator sequence exists, start a new component.  */
+      if(!dir_sep && info->term_sep_begin) {
+        info->base_sep_begin = info->term_sep_begin;
+        info->base_sep_end = pos;
+        info->term_sep_begin = NULL;
+      }
+
+      pos ++;
+    }
+
+    /* Stores the end of the path for convenience.  */
+    info->path_end = pos;
+  }
+
+char*
+dirname(char* path)
+  {
+    struct path_info info;
+    char* upath;
+    const char* top;
+    static char* static_path_copy;
+
+    if(path == NULL|| path[0] == 0)
+      return (char*) ".";
+
+    do_get_path_info(&info, path);
+    upath = info.prefix_end ? info.prefix_end : path;
+    top = (IS_DIR_SEP(path[0]) || IS_DIR_SEP(upath[0])) ? "\\" : ".";
+
+    /* If a non-terminating directory separator exists, it terminates the
+     * dirname. Truncate the path there.  */
+    if(info.base_sep_begin) {
+      info.base_sep_begin[0] = 0;
+
+      /* If the unprefixed path has not been truncated to empty, it is now
+       * the dirname, so return it.  */
+      if(upath[0])
+        return path;
+    }
+
+    /* The dirname is empty. In principle we return `<prefix>.` if the
+     * path is relative and `<prefix>\` if it is absolute. This can be
+     * optimized if there is no prefix.  */
+    if(upath == path)
+      return (char*) top;
+
+    /* When there is a prefix, we must append a character to the prefix.
+     * If there is enough room in the original path, we just reuse its
+     * storage.  */
+    if(upath != info.path_end) {
+      upath[0] = *top;
+      upath[1] = 0;
+      return path;
+    }
+
+    /* This is only the last resort. If there is no room, we have to copy
+     * the prefix elsewhere.  */
+    upath = realloc(static_path_copy, info.prefix_end - path + 2);
+    if(!upath)
+      return (char*) top;
+
+    static_path_copy = upath;
+    memcpy(upath, path, info.prefix_end - path);
+    upath += info.prefix_end - path;
+    upath[0] = *top;
+    upath[1] = 0;
+    return static_path_copy;
+  }
+
+char*
+basename(char* path)
+  {
+    struct path_info info;
+    char* upath;
+
+    if(path == NULL)
+      return (char*) ".";
+
+    do_get_path_info(&info, path);
+    upath = info.prefix_end ? info.prefix_end : path;
+
+    /* If the unprefixed path is empty, POSIX says '.' shall be returned.  */
+    if(upath[0] == 0)
+      return (char*) ".";
+
+    /* If a terminating separator sequence exists, it is not part of the
+     * name and shall be truncated.  */
+    if(info.term_sep_begin)
+      info.term_sep_begin[0] = 0;
+
+    /* If some other separator sequence has been found, the basename
+     * immediately follows it.  */
+    if(info.base_sep_end)
+      return info.base_sep_end;
+
+    /* If removal of the terminating separator sequence has caused the
+     * unprefixed path to become empty, it must have comprised only
+     * separators. POSIX says `/` shall be returned, but on Windows, we
+     * return `\` instead.  */
+    if(upath[0] == 0)
+      return (char*) "\\";
+
+    /* Return the unprefixed path.  */
+    return upath;
+  }
-- 
2.40.0

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to