On 26 June 2010 20:59, Michal Suchanek <hramr...@centrum.cz> wrote: > Indeed, the loss is at the end in case of web pages, parts which are > missing in the middle are result of inserting different streams so > SCSU would not suffer more breakage than other encodings. Still there > is no apparent benefit in using it.
For storing many short strings, whether compiled into one bundle or not, SCSU is ideal >> >> And HTML is also a file format with the equivalent of shifts; it just >> calls them tags. > > However, most HTML parsers are very well capable of parsing incomplete > HTML because the tags don't change the meaning of text except when it > is part of tag attribute. > ]]> begs to differ. But, again, we rarely experience this issue with the omnipresent binary formats. >>>> >>>> SCSU is of course a poor choice for an in-memory format (Use UTF-16) >>>> or interacting with the console (For backwards compatibility you're >>>> probably going to have to use UTF-8). But for a storage format, >>>> particularly one embedded within a database? It's pretty much perfect. >>> >>> Anybody who suggests to use UTF-16 for anything has no idea about >>> useful encodings in my book. UTF-16 has no advantage whatsoever, only >>> disadvantages. >> >> Would you care to enumerate your points then? >> > > UTF-8 is endianness independent and null-free, UTF-16 is not. In > transport losing a byte (or a packet with unknown, possibly odd number > of bytes) may corrupt at most one character of UTF-8, it may misalign > the whole stream of UTF-16. I said UTF-16 /in memory/. Not for transport. Whole different kettle of fish > UTF-32 is dword aligned, you can index into it as an array and every > position is a codepoint. UTF-16 has surrogate pairs so you have to > decode the whole string to get at codepoints. You rarely need to index into it at code-point intervals. For most things pointers are sufficient And you should note that dword is a rather vague term; I somehow presume you are referring to the x86' 32-bit double word (Which is not even consistent in x86 documentation - the i386 SysV ABI used by all unixlikes takes a word to be 32-bits). (I could also mention that every index in a UTF-16 string is also technically a codepoint, but lets not get into a battle of semantics; the correct term for what you are referring to is a scalar value). > I know no language for which UTF-16 is storage-efficient. For > languages using Latin UTF-8 or legasy encodings are about twice as > efficient. For Cyrrilic legacy encodings are much more efficient, I > don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about > 2/3 of UTF-8 but more efficient alternative encodings exist and are in > widespread use. Said more efficient alternative encodings are not Unicode and should not be considered a serialization of such. An endemic problem with using them as such is that some have mapped characters over the ASCII common set - a prime example being that Shift-JIS replaced the backslash with a Yen. Those legacy encodings also often require complex string search logic (Shift-JIS again being a prime example). For Chinese, the recommended backwards-compatible encoding is GB 18030. This is a good effort but flawed (Decoding it is an absolute nightmare), and should be converted to a more usable (e.g. UTF-16) format for in memory use. > If you know any advantage of UTF-16 then please enlighten me. UTF-16 is very efficient to work with. Its for this reason that many languages which adopted Unicode post the expansion of the coding space still picked it (Python for one). It is an effective tradeoff of space and speed. _______________________________________________ fossil-users mailing list fossil-users@lists.fossil-scm.org http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users