On 31 May 2016 at 18:43, FRIGN <[email protected]> wrote: > as a quick note, the sbase libutf is probably the most feature-rich one. > The version by cls suffers from multiple issues, even though it might > be the most recent.
Strictly speaking they're all by me, since I started it (and sbase) in the first place. But there we are. > I am currently working on a new libutf which is much simpler, much > more secure (de/encoder) and actually gets the grapheme handling right. One of the reasons I'm not pushing for any particular solution to the fragmentation problem is that I'm not sure what libutf should actually do. There are three components that are distinguishable in the Plan 9 API, which are UTF-8 (runetochar, chartorune, utf*, etc), UTF-32 (runestr*), and Unicode (is*rune, etc). The trouble is I don't think it's necessary for a single library to do all of these things. All UTF-8 is is an encoding of 31-bit integers, and UTF-32 is another encoding. The stuff specific to Unicode, which requires the latest Unicode database and all that, is really a separate issue -- as is the rejection of certain values, like surrogates or values over 0x10FFFF, both of which are only invalid because of the braindead UTF-16 encoding. And grapheme handling is another thing which has nothing actually to do with UTF. So in earlier versions of libutf I was vigilant in rejecting those values that Unicode say are invalid, but in my latest version on github I've started only rejecting overlong sequences, since the others are still (in my view) valid UTF-8 even if they aren't valid Unicode. Is this the right thing to do? I've not yet made up my mind. But my feeling is that the API for reading UTF-8 should be separate from that which deals with Unicode codepoints and graphemes that so happen to have been encoded in UTF-8. The two are essentially orthogonal, though are often conflated. Incidentally, I also changed my latest version to only ever need one byte of lookahead. For one thing, the Plan 9 version will say that a rune is not full even if it is, if it is malformed, which is fixed in my implementation. But another thing, which is only in my latest version, is that it always reads the fewest bytes needed to determine that the sequence is malformed. One benefit of this is that if you're reading with fgetc(), you can then ungetc() a byte that showed that the sequence was malformed (say, it was too short), and you are only guaranteed (by POSIX) to be able to ungetc() a single byte. That may not be relevant for sbase, of course, but I'm just saying there's a reason for the slight difference in complexity between the version in sbase and the latest version on my github. cls
