[Python-ideas] Re: Sanitize filename (path part)

Andrew Barnert via Python-ideas Tue, 12 May 2020 12:04:19 -0700

On May 12, 2020, at 01:32, Barry Scott <ba...@barrys-emacs.org> wrote:
> 
> 
>> On 11 May 2020, at 23:24, Andrew Barnert <abarn...@yahoo.com> wrote:
>> 
>>> On May 11, 2020, at 13:31, Barry Scott <ba...@barrys-emacs.org> wrote:
>>> 
>>> macOS and Unix version (I only use Unicode input so avoid the random bytes 
>>> problems):
>> 
>> But that doesn’t avoid the problem. If someone gives you a character whose 
>> encoding on the target filesystem includes a null or pathsep byte, your 
>> sanitizer will pass it as safe, when it shouldn’t.
> 
> Do you have a example that shows an encoding that produces a NUL or pathsep? 
> I'm not aware of any.


UTF-1 encodes U+D7FF to the bytes F7 2F C3. BOCU has similar examples. In the 
other direction, MUTF-8 decodes the bytes CO 80 to U+0000. There were a number 
of cross-site scripting and misleading-link attacks abusing (mostly) BOCU in 
this way, which is part of the reason WHATWG banned them as charsets. Although 
there were other reasons (they banned stuff like SCSU and CESU-8 and UTF-7 at 
the same time, and I don’t think any of them have the same problem). And if 
there were widespread legitimate uses of these codecs, they probably wouldn’t 
have been banned (see UTF-16LE, which is even easier to exploit this way, but 
unfortunately way too common).

I don’t think Python comes with codecs for any of these encodings. And I don’t 
know of anyone who ever used them for filenames. (SCSU was the default fs 
encoding on Symbian flash memory drives, but again, I don’t think it has this 
problem.) So this may well not be a practical problem.

>> Is it still a realistic problem today? I don’t know. I’m pretty sure the 
>> modern versions of Shift-JIS, EUC-*, Big5, and GB can never have 
>> continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, 
>> of course) the only multi-byte encodings anyone ever uses on Unix 
>> filesystems?
> 
> I suspect that legacy encoding are used in organisations with old data, but 
> do have direct experience of this.

I have direct experience of some of those East Asian codecs, albeit 15 or so 
years ago. I’m pretty sure the only ones they used were all safe.

I also have experience even further back of mounting drives from Ataris and 
classic Macs and IBM mainframes and all kinds of other crazy things under Unix, 
but the filesystem drivers recoded filenames on the fly, along with providing a 
Unix-style hierarchical filesystem, so user-level code didn’t have to worry 
about MacKorean or EBCDIC or whatever any more than it had to worry about : as 
a pathsep and absolute paths being the ones that _don’t_ start with a pathsep 
and so on.

So, based on my experience, it doesn’t seem likely to come up even in shops 
full of old data. But that experience isn’t worth much…


_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/7L466KEUYZ3ZA2IUBUD2L7UONQFPSECM/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Sanitize filename (path part)

Reply via email to