17-Feb-2014 06:19, Marco Leise пишет:
Am Sun, 09 Feb 2014 12:18:41 +0400
schrieb Dmitry Olshansky <[email protected]>:

09-Feb-2014 09:35, Marco Leise пишет:
Thats neither an improvement over calling "validate" nor does
that deal with distinguishing between invalid UTF and

Means text is broken but wasn't ever read...
\uFFFD
in the input.
...means text was broken sometime before.

Hardly makes any difference to the most applications.
Normal text doesn't contain \uFFFD.

Of course it does. It is a valid symbol and a lot of websites
describing the "Specials" Unicode block make use of it, like
the one on Wikipedia:
http://en.wikipedia.org/wiki/Specials_(Unicode_block)

With your definition, pulling such a document from the web and
parsing it in D would mean playing on broken strings.

In a sense, \uFFFD means broken encoding. What about lone surrogates? Private use symbols that must not occur in transmission? They all displayed in various Unicode listings. About 'playing on broken strings' - ignoring broken/partially broken strings, I specifically think that it's what most users/use cases want.

A more useful and sensible default of decoding is to substitute on broken encoding. And it's a standard procedure. It's particularly better for displaying text.

To remind: since it's only a decode you are still in the control of original text - in fact you may re-test what bytes are there IF you want.

The way of "throw on bad encoding" could be useful but I hardly see it as what you want for default.

I'm wary of breaking code that relies on throwing. For the moment I think the best course of action would be to introduce xdecode or some such that will do substitution on failure, see how it floats and then change ranges/foreach etc to use xdecode.

[...]
Every single text editor out there seems to disagree with you: they do
show you partially substituted text, not a dialog box "My bad, it's
broken UTF-8, I'm giving up!".

gedit does in fact throw an error message at you
saying "My bad, it's broken UTF-8, I'm giving up!".

I know and it's piece of junk :)
Seriously it doesn't even has regular expressions for search and replace!

https://yourlogicalfallacyis.com/no-true-scotsman :p

Well, gedit is a nice example of why just throwing exception is not good enough for many apps (editors in particular). The fact that it's piece of junk might be irrelevant ;)

--
Dmitry Olshansky

Reply via email to