在 2023-03-26 23:49, Alvin Wong 写道:
Hi,

Overall I think the implementation looks fine. The test cases are missing some of the more exotic DOS device paths[1], for example `\\.\Volume{b75e2c83-0000-0000-0000-602f00000000}\` but to be frank I don't know what should be the expected outputs for it. Should we simply declare that DOS device paths may not work as expected?

That is interpreted as a network path to the directory `Volume{b75e2c83-0000-0000-0000-602f00000000}` on the default directory on host `.`.

Not sure how to deal with that, but looks like it's no special case. For example, on Linux we can have a directory behind a mount point such as `/mnt/cdrom0/foo/bar`, and it's not different from a conventional path in any way; calling `dirname()` repeatedly will remove the mount point up to `/`.


Manual fuzzing with asan and ubsan revealed an issue: The line `info.prefix_end[1] = 0;` (line 213) can result in a buffer overrun. This happens with inputs in the form of an UNC host only, e.g. `\\host`, which is also missing from the test cases.

Oh thanks. That field was added too late and I forgot about it. Attached is a 
patch that has that fixed.

A UNC path with only a host name is not a valid path, however. I think it might make sense to return whatever that happens with current algorithm, but others disagree? At the moment if there is only a prefix

   C:
     ^-------- prefix ends here

and such a path is interpreted as the current working directory on drive C:. An path with only a host name has the same structure

   \\hostname\
              ^-------- prefix ends here

and `dirname()` returns `\\hostname\.`. But if the host name is given without a terminating separator, a dot is still appended to that (invalid) path and gives `\\hostname.`. Should this be converted to `\\hostname\.`?



--
Best regards,
LIU Hao

From ac4a2fc5c6c4ae36621695a115bbe66204ce0d6e Mon Sep 17 00:00:00 2001
From: LIU Hao <[email protected]>
Date: Sun, 26 Mar 2023 02:02:52 +0800
Subject: [PATCH] crt: Reimplement `dirname()` and `basename()`

Signed-off-by: LIU Hao <[email protected]>
---
 mingw-w64-crt/misc/dirname.c | 267 +++++++++++++++++++++++++++++++++++
 1 file changed, 267 insertions(+)

diff --git a/mingw-w64-crt/misc/dirname.c b/mingw-w64-crt/misc/dirname.c
index e69de29bb..87a46d4e7 100644
--- a/mingw-w64-crt/misc/dirname.c
+++ b/mingw-w64-crt/misc/dirname.c
@@ -0,0 +1,267 @@
+/**
+ * This file has no copyright assigned and is placed in the Public Domain.
+ * This file is part of the mingw-w64 runtime package.
+ * No warranty is given; refer to the file DISCLAIMER.PD within this package.
+ */
+#ifndef WIN32_LEAN_AND_MEAN
+#define WIN32_LEAN_AND_MEAN
+#endif
+#include <stdlib.h>
+#include <libgen.h>
+#include <windows.h>
+
+/* A 'directory separator' is a byte that equals 0x2F ('solidus' or more
+ * commonly 'forward slash') or 0x5C ('reverse solidus' or more commonly
+ * 'backward slash'). The byte 0x5C may look different from a backward slash
+ * in some locales; for example, it looks the same as a Yen sign in Japanese
+ * locales and a Won sign in Korean locales. Despite its appearance, it still
+ * functions as a directory separator.
+ *
+ * A 'path' comprises an optional DOS drive letter with a colon, and then an
+ * arbitrary number of possibily empty components, separated by non-empty
+ * sequences of directory separators (in other words, consecutive directory
+ * separators are treated as a single one). A path that comprises an empty
+ * component denotes the current working directory.
+ *
+ * An 'absolute path' comprises at least two components, the first of which
+ * is empty.
+ *
+ * A 'relative path' is a path that is not an absolute path. In other words,
+ * it either comprises an empty component, or begins with a non-empty
+ * component.
+ *
+ * POSIX doesn't have a concept about DOS drives. A path that does not have a
+ * drive letter starts from the same drive as the current working directory.
+ *
+ * For example:
+ * (Examples without drive letters match POSIX.)
+ *
+ *   Argument                 dirname() returns        basename() returns
+ *   --------                 -----------------        ------------------
+ *   `` or NULL               `.`                      `.`
+ *   `usr`                    `.`                      `usr`
+ *   `usr\`                   `.`                      `usr`
+ *   `\`                      `\`                      `\`
+ *   `\usr`                   `\`                      `usr`
+ *   `\usr\lib`               `\usr`                   `lib`
+ *   `\home\\dwc\\test`       `\home\\dwc`             `test`
+ *   `\\host\usr`             `\\host\.`               `usr`
+ *   `\\host\usr\lib`         `\\host\usr`             `lib`
+ *   `\\host\\usr`            `\\host\\`               `usr`
+ *   `\\host\\usr\lib`        `\\host\\usr`            `lib`
+ *   `C:`                     `C:.`                    `.`
+ *   `C:usr`                  `C:.`                    `usr`
+ *   `C:usr\`                 `C:.`                    `usr`
+ *   `C:\`                    `C:\`                    `\`
+ *   `C:\\`                   `C:\`                    `\`
+ *   `C:\\\`                  `C:\`                    `\`
+ *   `C:\usr`                 `C:\`                    `usr`
+ *   `C:\usr\lib`             `C:\usr`                 `lib`
+ *   `C:\\usr\\lib\\`         `C:\\usr`                `lib`
+ *   `C:\home\\dwc\\test`     `C:\home\\dwc`           `test`
+ */
+
+struct path_info
+  {
+    /* This points to end of the UNC prefix and drive letter, if any.  */
+    char* prefix_end;
+
+    /* These point to the directory separator in front of the last non-empty
+     * component.  */
+    char* base_sep_begin;
+    char* base_sep_end;
+
+    /* This points to the last directory separator sequence if no other
+     * non-separator characters follow it.  */
+    char* term_sep_begin;
+
+    /* This points to the end of the string.  */
+    char* path_end;
+  };
+
+#define IS_DIR_SEP(c)  ((c) == '/' || (c) == '\\')
+
+static
+void
+do_get_path_info(struct path_info* info, char* path)
+  {
+    char* pos = path;
+    DWORD cp;
+    int dbcs_tb, dir_sep;
+
+    /* Get the code page for paths in the same way as `fopen()`.  */
+    cp = AreFileApisANSI() ? CP_ACP : CP_OEMCP;
+
+    /* Set the structure to 'no data'.  */
+    info->prefix_end = NULL;
+    info->base_sep_begin = NULL;
+    info->base_sep_end = NULL;
+    info->term_sep_begin = NULL;
+
+    /* Check for a UNC prefix.  */
+    if(IS_DIR_SEP(pos[0]) && IS_DIR_SEP(pos[1])) {
+      pos += 2;
+      info->prefix_end = pos;
+
+      /* Seek to the end of the host name.  */
+      dbcs_tb = 0;
+      while(*pos != 0) {
+        dir_sep = 0;
+
+        if(dbcs_tb)
+          dbcs_tb = 0;
+        else if(IsDBCSLeadByteEx(cp, *pos))
+          dbcs_tb = 1;
+        else
+          dir_sep = IS_DIR_SEP(*pos);
+
+        if(dir_sep)
+          break;
+
+        pos ++;
+      }
+
+      if(*pos == 0) {
+        /* Only a host name exists.  */
+        info->prefix_end = pos;
+        info->path_end = pos;
+        return;
+      }
+
+      /* Host name terminates here. The terminating directory separator is
+       * part of the prefix.  */
+      pos ++;
+      info->prefix_end = pos;
+    }
+
+    /* Check for a DOS drive letter.  */
+    if((pos[0] >= 'A' && pos[0] <= 'Z' && pos[1] == ':')
+       || (pos[0] >= 'a' && pos[0] <= 'z' && pos[1] == ':')) {
+      pos += 2;
+      info->prefix_end = pos;
+    }
+
+    /* The remaining part of the path is almost the same as POSIX.  */
+    dbcs_tb = 0;
+    while(*pos != 0) {
+      dir_sep = 0;
+
+      if(dbcs_tb)
+        dbcs_tb = 0;
+      else if(IsDBCSLeadByteEx(cp, *pos))
+        dbcs_tb = 1;
+      else
+        dir_sep = IS_DIR_SEP(*pos);
+
+      /* If a separator has been encountered and the previous character
+       * was not, mark this as the beginning of the terminating separator
+       * sequence.  */
+      if(dir_sep && !info->term_sep_begin)
+        info->term_sep_begin = pos;
+
+      /* If a non-separator character has been encountered and a previous
+       * terminating separator sequence exists, start a new component.  */
+      if(!dir_sep && info->term_sep_begin) {
+        info->base_sep_begin = info->term_sep_begin;
+        info->base_sep_end = pos;
+        info->term_sep_begin = NULL;
+      }
+
+      pos ++;
+    }
+
+    /* Stores the end of the path for convenience.  */
+    info->path_end = pos;
+  }
+
+char*
+dirname(char* path)
+  {
+    struct path_info info;
+    char* upath;
+    const char* top;
+    static char* static_path_copy;
+
+    if(path == NULL|| path[0] == 0)
+      return (char*) ".";
+
+    do_get_path_info(&info, path);
+    upath = info.prefix_end ? info.prefix_end : path;
+    top = IS_DIR_SEP(upath[0]) ? "\\" : ".";
+
+    /* If a non-terminating directory separator exists, it terminates the
+     * dirname. Truncate the path there.  */
+    if(info.base_sep_begin) {
+      info.base_sep_begin[0] = 0;
+
+      /* If the unprefixed path has not been truncated to empty, it is now
+       * the dirname, so return it.  */
+      if(upath[0])
+        return path;
+    }
+
+    /* The dirname is empty. In principle we return `<prefix>.` if the
+     * path is relative and `<prefix>\` if it is absolute. This can be
+     * optimized if there is no prefix.  */
+    if(upath == path)
+      return (char*) top;
+
+    /* When there is a prefix, we must append a character to the prefix.
+     * If there is enough room in the original path, we just reuse its
+     * storage.  */
+    if(upath != info.path_end) {
+      upath[0] = *top;
+      upath[1] = 0;
+      return path;
+    }
+
+    /* This is only the last resort. If there is no room, we have to copy
+     * the prefix elsewhere.  */
+    upath = realloc(static_path_copy, info.prefix_end - path + 2);
+    if(!upath)
+      return (char*) top;
+
+    static_path_copy = upath;
+    memcpy(upath, path, info.prefix_end - path);
+    upath += info.prefix_end - path;
+    upath[0] = *top;
+    upath[1] = 0;
+    return static_path_copy;
+  }
+
+char*
+basename(char* path)
+  {
+    struct path_info info;
+    char* upath;
+
+    if(path == NULL)
+      return (char*) ".";
+
+    do_get_path_info(&info, path);
+    upath = info.prefix_end ? info.prefix_end : path;
+
+    /* If the unprefixed path is empty, POSIX says '.' shall be returned.  */
+    if(upath[0] == 0)
+      return (char*) ".";
+
+    /* If a terminating separator sequence exists, it is not part of the
+     * name and shall be truncated.  */
+    if(info.term_sep_begin)
+      info.term_sep_begin[0] = 0;
+
+    /* If some other separator sequence has been found, the basename
+     * immediately follows it.  */
+    if(info.base_sep_end)
+      return info.base_sep_end;
+
+    /* If removal of the terminating separator sequence has caused the
+     * unprefixed path to become empty, it must have comprised only
+     * separators. POSIX says `/` shall be returned, but on Windows, we
+     * return `\` instead.  */
+    if(upath[0] == 0)
+      return (char*) "\\";
+
+    /* Return the unprefixed path.  */
+    return upath;
+  }
-- 
2.40.0

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

_______________________________________________
Mingw-w64-public mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to