07-Feb-2014 21:07, Andrej Mitrovic пишет:
On 2/7/14, Dmitry Olshansky <[email protected]> wrote:
Much simpler - it returns a special dchar to designate bad encoding. And
there is one defined by Unicode spec.

A NaN for chars? Sounds great to me! :)


It's called \uFFFD and is specifically for bad encodings. I wonder why nobody had perused the spec when writing std.utf.decode in the first place...

5.22 Best Practice for U+FFFD Substitution

When converting text from one character encoding to another, a conversion algorithm may encounter unconvertible code units. This is most commonly caused by some sort of corruption of the source data, so that it does not correctly follow the specification for that character encoding. Examples include dropping a byte in a multibyte encoding such as Shift-JIS, improper concatenation of strings, a mismatch between an encoding declaration
and actual encoding of text, use of non-shortest form for UTF-8, and so on.

...

Whenever an unconvertible offset is reached during conversion of a code
unit sequence:
1. The maximal subpart at that offset should be replaced by a single
U+FFFD.
2. The conversion should proceed at the offset immediately after the maximal
subpart.
---

Fast, simple and according to the standard. Best of all - no stinkin' exceptions! ;)

--
Dmitry Olshansky

Reply via email to