On Friday, February 07, 2014 21:04:08 Jonathan M Davis wrote: > On Saturday, February 08, 2014 05:29:35 Marco Leise wrote: > > I guess we just have two use cases here. One where invalid > > encoding is not an error (e.g. for sanitizing purposes) and > > one where you don't want to lose information and have to > > enforce correct encoding. > > Name the first one "decodeSubst" maybe and have decode call > > that and check for 0xFFFD? > > I think that that would call for us to have 3 related but distinct > functions: > > 1. decode, which throws on invalid Unicode. We already have this. > > 2. isValidUnicode, which returns whether the string is valid Unicode and > does not throw. We don't yet have this. Rather, we have validate which does > the same job and then throws instead of returning bool. > > 3. sanitizeUnicode (or whatever would be a good name for it), which replaces > invalid Unicode with 0xFFFD (or whatever the appropriate character is) so > that it can be operated on without causing decode to throw in spite of the > fact that it was invalid Unicode. We don't have anything like this yet.
Actually, thinking this through some more, if we can replace invalid Unicode with 0xFFFD, and have all algorithms work with that and consider it valid Unicode (rather than getting weird bugs due to invalid Unicode), then if decode returned that on error rather than throwing, we wouldn't actually need to check the return value. It wouldn't matter that the Unicode was invalid. So, we wouldn't even need to _care_ that the Unicode was invalid. Anyone who _did_ care could call isValidUnicode to validate the Unicode first, and those who didn't wouldn't need to worry about UTFException being thrown, because everything would still work even if the string was invalid Unicode. So, if that's indeed what 0xFFFD does, and that's what Dmitry meant by proposing that we return that rather than throwing, then I rescind my assessment that throwing was the best way to go and have to agree that returning 0xFFFD would be better. I was responding under the assumption that you had to check for 0xFFFD and respond to it order to avoid having your code be buggy, in which case throwing would be far better. But if 0xFFFD is considered valid Unicode, then returning that would be a fantastic solution. And if that's the case, we only need two functions, not three: 1. decode, which returns 0xFFFD on decode failure 2. isValidUnicode, which returns whether the string is valid And I actually really like the idea that we could just operate on invalid Unicode as valid Unicode this way, making it so that most code doesn't need to care, and code that _does_ need to care, can validate the strings first. Right now, pretty much all string code needs to care in order to avoid processing invalid Unicode, which is much messier. - Jonathan M Davis
