[Python-ideas] Re: Sanitize filename (path part)

Andrew Barnert via Python-ideas Sat, 09 May 2020 22:36:08 -0700

On May 9, 2020, at 17:35, Steve Jorgensen <ste...@stevej.name> wrote:
> 
> I believe the Python standard library should include a means of sanitizing a 
> filesystem entry, and this should not be something requiring a 3rd party 
> package.
> 
> One of reasons I think this should be in the standard lib is because that 
> provides a common, simple means for code reviewers and static analysis 
> services such as Veracode to recognize that a value is sanitized in an 
> accepted manner.


This does seem like a good idea. People who do this themselves get it wrong all 
the time, occasionally with disastrous consequences, so if Python can solve 
that, that would be great.

But, at least historically, this has been more complicated than what you’re 
suggesting here. For example, don’t you have to catch things like directories 
named “Con” or files whose 8.3 representation has “CON” as the 8 part? I don’t 
think you can hang an entire Windows system by abusing those anymore, but you 
can still produce filenames that some APIs, and some tools (possibly including 
Explorer, cmd, powershell, Cygwin, mingw/native shells, Python itself…) can’t 
access (or can only access if the user manually specified a \\.\ absolute path, 
or whatever).

Is there an established algorithm/rule that lots of people in the industry 
trust that Python can just reference, instead of having to research or invent 
it? Because otherwise, we run the risk of making things worse instead of better.

> What I am envisioning is a function (presumably in `os.path` with a signature 
> roughly like
> {{{
> sanitizepart(name, permissive=False, mode=ESCAPE, system=None)
> }}}

Maybe it would make more sense to put this in pathlib. Then you construct a 
PurePath of the appropriate type, and call sanitize() on it (maybe with a flag 
that ensures that it’s a single path component if you expected it to be one).

I think some, but not all, of this logic already exists in pathlib.

> When `permissive` is `False`, characters that are generally unsafe are 
> rejected. When `permissive` is `True`, only path separator characters are 
> rejected. Generally unsafe characters besides path separators would include 
> things like a leading ".", any non-printing character, any wildcard, piping 
> and redirection characters, etc.

I think neither of these is what I’d usually want.

I never want to sanitize just pathsep characters without sanitizing all illegal 
characters.

I do often want to sanitize all illegal characters (just \0 and the path sep on 
POSIX, a larger set that I don’t know by heart on Windows).

I don’t think I’ve ever wanted to sanitize the set of potentially-unsafe 
characters you’re proposing here.

I have wanted to sanitize (or pop up an “are you sure?” dialog, etc.) a wider 
range of potentially confusing characters. For example, newlines or Unicode 
separators can be very confusing in filenames. I’ve used one of those 
“potentially misleading URL” libs for this even though files and URLs aren’t 
quite the same and it was definitely overzealous, but if I’m not really 
confident that someone has thought through the details and widely vetted them, 
I’d rather have overzealous than underzealous for something like this.

Meanwhile, on POSIX, it’s actually bytes rather than characters that are 
illegal. Any character that, in the filesystem’s encoding, would have a \0 or 
\x2f is therefore illegal. Of course in UTF-8, the only such characters are NUL 
and /, so in scripts I write for my own use on my own systems where I know all 
the filesystems are UTF-8 I don’t worry about this  But mething meant for 
hardening/verification tools seems like it needs to meet a higher standard and 
work on more varied systems. And I don’t know how you could even apply the 
right rule without knowing what the file system encoding is (which means you 
need the full path, not just the component to be checked) or requiring bytes 
rather than str (but then it doesn’t work for Windows, and resolving that whole 
mess gets extra fun, and even on POSIX it’s a lot less common to use).

Speaking of encodings and Windows, isn’t any character not in the user’s OEM 
code page likely to be confusing? Sure, it’ll work with other Python 3.8 
scripts, but it’ll crash or do the wrong thing or display mojibake when used 
with lots of other tools.

> The `mode` argument indicates what to do with unacceptable characters. Escape 
> them (`ESCAPE`), omit them (`OMIT`) or raise an exception (`RAISE`).

What’s the exception, and what attributes does it have? Usually I don’t care 
too much as long as the traceback/log entry/whatever is good enough for 
debugging, but for this function, I think I’d often want to be able to 
programmatically access the character(s) that triggered the error so I can tell 
the user. Especially if the rule isn’t a fixed, well-known one that you can 
describe the way Windows Explorer does when you try to use an illegal character.

> This could also double as an escape character argument when a string is 
> given. The default escape character should probably be "%" (same as URL 
> encoding).

But % only makes sense with a specific encoding of the escaped character, which 
is a totally different encoding than the one used by other escape mechanisms, 
so how can just an escape string select between them? If I give it \U expecting 
to get JSON escapes but instead get % escapes with \U in place of %, that won’t 
be at all useful. In fact, passing any string at all besides % won’t be at all 
useful, because I don’t think there’s any other escape mechanism with the same 
rules as %-encoding but a different escape character. 

Not only that, but %-encoding doesn’t make sense with a different list of 
characters to be encoded than the one actually used by URLs, most obviously 
because % itself is not an unsafe filename character but you’d better be 
escaping it anyway, or the system is ridiculously easy to break/exploit.

More importantly, what would %-encoding be good for? No other program—including 
Finder/Explorer, native GUI apps, native shell tools, etc.—will know how to 
generate the same name from the same user input, much less how to convert it 
back to something human-readable, etc. Even browsers following file: URLs won’t 
be able to use these names, and in fact it’ll be pretty confusing that (a) it’s 
misleadingly close to URL escaping but not the same, and (b) you have to 
%-escape the %-escaped filename to actually get a usable file URL out of it.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/ARG2UAFQGG5JD4FLGNH5NFAJYCE7PI62/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Sanitize filename (path part)

Reply via email to