Am Sat, 08 Feb 2014 15:21:26 +0400 schrieb Dmitry Olshansky <[email protected]>:
> 08-Feb-2014 02:57, Jonathan M Davis пишет: > > On Friday, February 07, 2014 20:43:38 Dmitry Olshansky wrote: > >> 07-Feb-2014 20:29, Andrej Mitrovic пишет: > >>> On Friday, 7 February 2014 at 16:27:35 UTC, Andrei Alexandrescu wrote: > >>>> Add a bugzilla and let's define isValid that returns bool! > >>> > >>> Add std.utf.decode() to that as well. IOW, it should have an overload > >>> which returns a status code > >> > >> Much simpler - it returns a special dchar to designate bad encoding. And > >> there is one defined by Unicode spec. > > > > Isn't that actually worse? > > No, it's better and more flexible for those who care to repair broken > text in case it's broken. We currently have ZERO facilities to work with > partly broken UTF and it's not that rare thing to have it. Your argument is unsubstantiated, since we have this already: http://dlang.org/phobos/std_encoding.html#.sanitize > > Unless you're suggesting that we stop throwing on > > decode errors, > > That is exactly what I suggest. > > then functions like std.array.front will have to check the > > result on every call to see whether it was valid or not and thus whether > > they > > should throw, which would mean extra overhead over simply having decode > > throw > > on decode errors. > > Why the heck? It will not throw either. In the very end bad encoding is > handled by displaying the 'substituted' (typically '?') character in > places where it broke not by throwing up hands in the air and spitting > "UTF Exception: offset 4302 bad UTF sequence". This is not good enough > (in case somebody though that it is). > > Those who care about throwing add a trivial map!(x => x != '\uFFFD' || > die()) over a string, where die function throws an exception. Thats neither an improvement over calling "validate" nor does that deal with distinguishing between invalid UTF and \uFFFD in the input. > > validate has no business throwing, and we definitely should > > add isValidUnicode (or isValid or whatever you want to call it) for > > validation > > purposes. Code can then call that to validate that a string is valid and not > > worry about any UTFExceptions being thrown as long as it doesn't manipulate > > the string in a way that could result in its Unicode becoming invalid. > > Yet later down the road decode will triple check that anyway. Just > saying. BTW if the string was checked beforehand there is no difference > between 2 approaches at all (don't have to check). > > > However, I would argue that assuming that everyone is going to validate > > their > > strings and that pretty much all string-related functions shouldn't ever > > have > > to worry about invalid Unicode is just begging for subtle bugs all over the > > place IMHO. You're essentially dealing with error codes at that point, and I > > think that experience has shown quite clearly that error codes are > > generally a > > bad way to go. Almost no one checks them unless they have to. I think that > > having decode throw on invalid Unicode is exactly what it should be doing. > > The > > problem is that validate shouldn't. > > Every single text editor out there seems to disagree with you: they do > show you partially substituted text, not a dialog box "My bad, it's > broken UTF-8, I'm giving up!". Editor do different things. They often try to detect the encoding with a fall back to Latin1. If you open a file explicitly as UTF-8 they may display a substitution char or detect the error and use the fall back, as is the case with Geany and gedit does in fact throw an error message at you saying "My bad, it's broken UTF-8, I'm giving up!". -- Marco
