17-Feb-2014 06:19, Marco Leise пишет:
Am Sun, 09 Feb 2014 12:18:41 +0400
schrieb Dmitry Olshansky <[email protected]>:
09-Feb-2014 09:35, Marco Leise пишет:
Thats neither an improvement over calling "validate" nor does
that deal with distinguishing between invalid UTF and
Means text is broken but wasn't ever read...
\uFFFD
in the input.
...means text was broken sometime before.
Hardly makes any difference to the most applications.
Normal text doesn't contain \uFFFD.
Of course it does. It is a valid symbol and a lot of websites
describing the "Specials" Unicode block make use of it, like
the one on Wikipedia:
http://en.wikipedia.org/wiki/Specials_(Unicode_block)
With your definition, pulling such a document from the web and
parsing it in D would mean playing on broken strings.
In a sense, \uFFFD means broken encoding. What about lone surrogates?
Private use symbols that must not occur in transmission? They all
displayed in various Unicode listings. About 'playing on broken strings'
- ignoring broken/partially broken strings, I specifically think that
it's what most users/use cases want.
A more useful and sensible default of decoding is to substitute on
broken encoding. And it's a standard procedure. It's particularly better
for displaying text.
To remind: since it's only a decode you are still in the control of
original text - in fact you may re-test what bytes are there IF you want.
The way of "throw on bad encoding" could be useful but I hardly see it
as what you want for default.
I'm wary of breaking code that relies on throwing. For the moment I
think the best course of action would be to introduce xdecode or some
such that will do substitution on failure, see how it floats and then
change ranges/foreach etc to use xdecode.
[...]
Every single text editor out there seems to disagree with you: they do
show you partially substituted text, not a dialog box "My bad, it's
broken UTF-8, I'm giving up!".
gedit does in fact throw an error message at you
saying "My bad, it's broken UTF-8, I'm giving up!".
I know and it's piece of junk :)
Seriously it doesn't even has regular expressions for search and replace!
https://yourlogicalfallacyis.com/no-true-scotsman :p
Well, gedit is a nice example of why just throwing exception is not good
enough for many apps (editors in particular). The fact that it's piece
of junk might be irrelevant ;)
--
Dmitry Olshansky