On May 9, 2020, at 17:35, Steve Jorgensen <ste...@stevej.name> wrote: > > I believe the Python standard library should include a means of sanitizing a > filesystem entry, and this should not be something requiring a 3rd party > package. > > One of reasons I think this should be in the standard lib is because that > provides a common, simple means for code reviewers and static analysis > services such as Veracode to recognize that a value is sanitized in an > accepted manner.
This does seem like a good idea. People who do this themselves get it wrong all the time, occasionally with disastrous consequences, so if Python can solve that, that would be great. But, at least historically, this has been more complicated than what you’re suggesting here. For example, don’t you have to catch things like directories named “Con” or files whose 8.3 representation has “CON” as the 8 part? I don’t think you can hang an entire Windows system by abusing those anymore, but you can still produce filenames that some APIs, and some tools (possibly including Explorer, cmd, powershell, Cygwin, mingw/native shells, Python itself…) can’t access (or can only access if the user manually specified a \\.\ absolute path, or whatever). Is there an established algorithm/rule that lots of people in the industry trust that Python can just reference, instead of having to research or invent it? Because otherwise, we run the risk of making things worse instead of better. > What I am envisioning is a function (presumably in `os.path` with a signature > roughly like > {{{ > sanitizepart(name, permissive=False, mode=ESCAPE, system=None) > }}} Maybe it would make more sense to put this in pathlib. Then you construct a PurePath of the appropriate type, and call sanitize() on it (maybe with a flag that ensures that it’s a single path component if you expected it to be one). I think some, but not all, of this logic already exists in pathlib. > When `permissive` is `False`, characters that are generally unsafe are > rejected. When `permissive` is `True`, only path separator characters are > rejected. Generally unsafe characters besides path separators would include > things like a leading ".", any non-printing character, any wildcard, piping > and redirection characters, etc. I think neither of these is what I’d usually want. I never want to sanitize just pathsep characters without sanitizing all illegal characters. I do often want to sanitize all illegal characters (just \0 and the path sep on POSIX, a larger set that I don’t know by heart on Windows). I don’t think I’ve ever wanted to sanitize the set of potentially-unsafe characters you’re proposing here. I have wanted to sanitize (or pop up an “are you sure?” dialog, etc.) a wider range of potentially confusing characters. For example, newlines or Unicode separators can be very confusing in filenames. I’ve used one of those “potentially misleading URL” libs for this even though files and URLs aren’t quite the same and it was definitely overzealous, but if I’m not really confident that someone has thought through the details and widely vetted them, I’d rather have overzealous than underzealous for something like this. Meanwhile, on POSIX, it’s actually bytes rather than characters that are illegal. Any character that, in the filesystem’s encoding, would have a \0 or \x2f is therefore illegal. Of course in UTF-8, the only such characters are NUL and /, so in scripts I write for my own use on my own systems where I know all the filesystems are UTF-8 I don’t worry about this But mething meant for hardening/verification tools seems like it needs to meet a higher standard and work on more varied systems. And I don’t know how you could even apply the right rule without knowing what the file system encoding is (which means you need the full path, not just the component to be checked) or requiring bytes rather than str (but then it doesn’t work for Windows, and resolving that whole mess gets extra fun, and even on POSIX it’s a lot less common to use). Speaking of encodings and Windows, isn’t any character not in the user’s OEM code page likely to be confusing? Sure, it’ll work with other Python 3.8 scripts, but it’ll crash or do the wrong thing or display mojibake when used with lots of other tools. > The `mode` argument indicates what to do with unacceptable characters. Escape > them (`ESCAPE`), omit them (`OMIT`) or raise an exception (`RAISE`). What’s the exception, and what attributes does it have? Usually I don’t care too much as long as the traceback/log entry/whatever is good enough for debugging, but for this function, I think I’d often want to be able to programmatically access the character(s) that triggered the error so I can tell the user. Especially if the rule isn’t a fixed, well-known one that you can describe the way Windows Explorer does when you try to use an illegal character. > This could also double as an escape character argument when a string is > given. The default escape character should probably be "%" (same as URL > encoding). But % only makes sense with a specific encoding of the escaped character, which is a totally different encoding than the one used by other escape mechanisms, so how can just an escape string select between them? If I give it \U expecting to get JSON escapes but instead get % escapes with \U in place of %, that won’t be at all useful. In fact, passing any string at all besides % won’t be at all useful, because I don’t think there’s any other escape mechanism with the same rules as %-encoding but a different escape character. Not only that, but %-encoding doesn’t make sense with a different list of characters to be encoded than the one actually used by URLs, most obviously because % itself is not an unsafe filename character but you’d better be escaping it anyway, or the system is ridiculously easy to break/exploit. More importantly, what would %-encoding be good for? No other program—including Finder/Explorer, native GUI apps, native shell tools, etc.—will know how to generate the same name from the same user input, much less how to convert it back to something human-readable, etc. Even browsers following file: URLs won’t be able to use these names, and in fact it’ll be pretty confusing that (a) it’s misleadingly close to URL escaping but not the same, and (b) you have to %-escape the %-escaped filename to actually get a usable file URL out of it. _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/ARG2UAFQGG5JD4FLGNH5NFAJYCE7PI62/ Code of Conduct: http://python.org/psf/codeofconduct/