2017-05-16 12:40 GMT+02:00 Henri Sivonen via Unicode <unicode@unicode.org>:
> > One additional note: the standard codifies this behaviour as a > *recommendation*, not a requirement. > > This is an odd argument in favor of changing it. If the argument is > that it's just a recommendation that you don't need to adhere to, > surely then the people who don't like the current recommendation > should choose not to adhere to it instead of advocating changing it. I also agree. The internet is full of RFC specifications that are also "best practices" and even in this case, changing them must be extensively documented, including discussing new compatibility/interoperability problems and new security risks. The case of random access in substrings is significant because what was once valid UTF-8 could become invalid if the best recommandation is not followed, and then could cause unexpected failures, uncaught exceptions causing software to suddenly fail and become subject to possible attacks due to this new failure (this is mostly a problem for implementations that do not use "safe" U+FFFD replacements but throw exceptions on ill-formed input: we should not change the cases where these exceptions may occur by adding new cases caused by a change of implementation based on a change of best practice). The considerations about trying to reduce the nnumber of U+FFFD is not relevant, purely esthetic because some people would like to compact the decoded result in memory. What is really import is to not ignore silently these ill-formed sequences, and properly track that there was some data loss. The number of U+FFFD (only one or as many as there are invalid code units in the input before the first resynchronization point) inserted is not so important. As well, wether implementations will use an accumulator or just a single state (where each state knows how many code units have been parsed without emitting an output code point, so that these code points can be decoded by relative indexed accesses) is not relevant, it is just a very minor optimization case (in my opinion, using an accumulator that can live in a CPU register is faster than using relative indexed accesses All modern CPUs have enough registers to store that accumulator, and the input and output pointers, and a finite state number is not needed when the state can be tracked by the executable instruction position where you don't necessarily need to loop for each code unit but can easily write your decoder so that each loop will process a full codepoint or will emit a single U+FFFD before adjusting the input pointer : UTF-8 and UTF-16 complexity is small enough that unwinding such loops will be easy to implement for processing full code points instead of single code units: That code will still remain very small (fitting fully in instruction cache), and it will be faster because it will avoid several conditional branches and because it will save one register (for the finite state number) that will not ned to be slowly saved on a stack: 2 pointer registers (or 2 access function/method addresses) + 2 data registers + the PC instruction counter is enough.