On May 12, 2020, at 01:32, Barry Scott <ba...@barrys-emacs.org> wrote: > > >> On 11 May 2020, at 23:24, Andrew Barnert <abarn...@yahoo.com> wrote: >> >>> On May 11, 2020, at 13:31, Barry Scott <ba...@barrys-emacs.org> wrote: >>> >>> macOS and Unix version (I only use Unicode input so avoid the random bytes >>> problems): >> >> But that doesn’t avoid the problem. If someone gives you a character whose >> encoding on the target filesystem includes a null or pathsep byte, your >> sanitizer will pass it as safe, when it shouldn’t. > > Do you have a example that shows an encoding that produces a NUL or pathsep? > I'm not aware of any.
UTF-1 encodes U+D7FF to the bytes F7 2F C3. BOCU has similar examples. In the other direction, MUTF-8 decodes the bytes CO 80 to U+0000. There were a number of cross-site scripting and misleading-link attacks abusing (mostly) BOCU in this way, which is part of the reason WHATWG banned them as charsets. Although there were other reasons (they banned stuff like SCSU and CESU-8 and UTF-7 at the same time, and I don’t think any of them have the same problem). And if there were widespread legitimate uses of these codecs, they probably wouldn’t have been banned (see UTF-16LE, which is even easier to exploit this way, but unfortunately way too common). I don’t think Python comes with codecs for any of these encodings. And I don’t know of anyone who ever used them for filenames. (SCSU was the default fs encoding on Symbian flash memory drives, but again, I don’t think it has this problem.) So this may well not be a practical problem. >> Is it still a realistic problem today? I don’t know. I’m pretty sure the >> modern versions of Shift-JIS, EUC-*, Big5, and GB can never have >> continuation bytes below 0x30, but even if I’m right, are these (and UTF-8, >> of course) the only multi-byte encodings anyone ever uses on Unix >> filesystems? > > I suspect that legacy encoding are used in organisations with old data, but > do have direct experience of this. I have direct experience of some of those East Asian codecs, albeit 15 or so years ago. I’m pretty sure the only ones they used were all safe. I also have experience even further back of mounting drives from Ataris and classic Macs and IBM mainframes and all kinds of other crazy things under Unix, but the filesystem drivers recoded filenames on the fly, along with providing a Unix-style hierarchical filesystem, so user-level code didn’t have to worry about MacKorean or EBCDIC or whatever any more than it had to worry about : as a pathsep and absolute paths being the ones that _don’t_ start with a pathsep and so on. So, based on my experience, it doesn’t seem likely to come up even in shops full of old data. But that experience isn’t worth much… _______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/7L466KEUYZ3ZA2IUBUD2L7UONQFPSECM/ Code of Conduct: http://python.org/psf/codeofconduct/