On Thu, 18 May 2017 09:58:43 +0100 Alastair Houghton via Unicode <unicode@unicode.org> wrote:
> On 18 May 2017, at 07:18, Henri Sivonen via Unicode > <unicode@unicode.org> wrote: > > > > the decision complicates U+FFFD generation when validating UTF-8 by > > state machine. > > It *really* doesn’t. Even if you’re hell bent on using a pure state > machine approach, you need to add maybe two additional error states > (two-trailing-bytes-to-eat-then-fffd and > one-trailing-byte-to-eat-then-fffd) on top of the states you already > have. The implementation complexity argument is a *total* red > herring. For big programs, yes. However, for a small program it can be attractive to have a small hand-coded routine so that the source code can sit in a single file. It can even allow a basically UTF-8 program to meet a requirement to be able to match lone surrogates in a regular expression, as was once required. Richard.