On Sun, Dec 7, 2008 at 11:04 PM, Glenn Linderman <[EMAIL PROTECTED]> wrote: > On approximately 12/7/2008 9:11 PM, came the following characters from the > keyboard of Adam Olsen: >> On Sun, Dec 7, 2008 at 9:45 PM, Glenn Linderman <[EMAIL PROTECTED]> >> wrote: > > Once upon a time I did write an unvalidated UTF-8 encoder/decoder in C, I > wonder if I could find that code? Can you supply a validated decoder? Then > we could run some benchmarks, eh?
There is no point for me, as the behaviour of a real UTF-8 codec is clear. It is you who needs to justify a second non-standard UTF-8-ish codec. See below. >>> You didn't address the issue that if the decoding to a canonical form is >>> done first, many of the insecurities just go away, so why throw errors? >> >> Unicode is intended to allow interaction between various bits of >> software. It may be that a library checked it in UTF-8, then passed >> it to python. It would be nice if the library validated too, but a >> major advantage of UTF-8 is older libraries (or protocols!) intended >> for ASCII need only be 8-bit clean to be repurposed for UTF-8. Their >> security checks continue to work, so long as nobody down stream >> introduces problems with a non-validating decoder. > > > So I don't understand how this is responsive to the "decoding removes many > insecurities" issue? > > Yes, you might use libraries. Either they have insecurities, or not. Either > they validate, or not. Either they decode, or not. They may be immune to > certain attacks, because of their structure and code, or not. > > So when you examine a library for potential use, you have documentation or > code to help you set your expectations about what it does, and whether or > not it may have vulnerabilities, and whether or not those vulnerabilities > are likely or unlikely, whether you can reduce the likelihood or prevent the > vulnerabilities by wrapping the API, etc. And so you choose to use the > library, or not. > > This whole discussion about libraries seems somewhat irrelevant to the > question at hand, although it is certainly true that understanding how a > library handles Unicode is an important issue for the potential user of a > library. > > So how does a non-validating decoder introduce problems? I can see that it > might not solve all problems, but how does it introduce problems? Wouldn't > the problems be introduced by something else, and the use of a > non-validating decoder may not catch the problem... but not be the cause of > the problem? > > And then, if you would like to address the original issue, that would be > fine too. Your non-validating encoder is translating an invalid sequence into a valid one, thus you are introducing the problem. A completely naive environment (8-bit clean ASCII) would leave it as an invalid sequence throughout. This is not a theoretical problem. See http://tools.ietf.org/html/rfc3629#section-10 . We MUST reject invalid sequences, or else we are not using UTF-8. There is no wiggle room, no debate. (The absoluteness is why the standard behaviour doesn't need a benchmark. You are essentially arguing that, when logging in as root over the internet, it's a lot faster if you use telnet rather than ssh. One is simply not an option.) -- Adam Olsen, aka Rhamphoryncus _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com