Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Owen Shepherd Sat, 26 Jun 2010 14:48:00 -0700

On 26 June 2010 20:59, Michal Suchanek <hramr...@centrum.cz> wrote:
> Indeed, the loss is at the end in case of web pages, parts which are
> missing in the middle are result of inserting different streams so
> SCSU would not suffer more breakage than other encodings. Still there
> is no apparent benefit in using it.


For storing many short strings, whether compiled into one bundle or
not, SCSU is ideal

>>
>> And HTML is also a file format with the equivalent of shifts; it just
>> calls them tags.
>
> However, most HTML parsers are very well capable of parsing incomplete
> HTML because the tags don't change the meaning of text except when it
> is part of tag attribute.
>

]]> begs to differ. But, again, we rarely experience this issue with
the omnipresent binary formats.

>>>>
>>>> SCSU is of course a poor choice for an in-memory format (Use UTF-16)
>>>> or interacting with the console (For backwards compatibility you're
>>>> probably going to have to use UTF-8). But for a storage format,
>>>> particularly one embedded within a database? It's pretty much perfect.
>>>
>>> Anybody who suggests to use UTF-16 for anything has no idea about
>>> useful encodings in my book. UTF-16 has no advantage whatsoever, only
>>> disadvantages.
>>
>> Would you care to enumerate your points then?
>>
>
> UTF-8 is endianness independent and null-free, UTF-16 is not. In
> transport losing a byte (or a packet with unknown, possibly odd number
> of bytes) may corrupt at most one character of UTF-8, it may misalign
> the whole stream of UTF-16.

I said UTF-16 /in memory/. Not for transport. Whole different kettle of fish

> UTF-32 is dword aligned, you can index into it as an array and every
> position is a codepoint. UTF-16 has surrogate pairs so you have to
> decode the whole string to get at codepoints.

You rarely need to index into it at code-point intervals. For most things
pointers are sufficient

And you should note that dword is a rather vague term; I somehow
presume you are referring to the x86' 32-bit double word (Which is not
even consistent in x86 documentation - the i386 SysV ABI used by all
unixlikes takes a word to be 32-bits).

(I could also mention that every index in a UTF-16 string is also
technically a codepoint, but lets not get into a battle of semantics;
the correct term for what you are referring to is a scalar value).

> I know no language for which UTF-16 is storage-efficient. For
> languages using Latin UTF-8 or legasy encodings are about twice as
> efficient. For Cyrrilic legacy encodings are much more efficient, I
> don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about
> 2/3 of UTF-8 but more efficient alternative encodings exist and are in
> widespread use.

Said more efficient alternative encodings are not Unicode and should
not be considered a serialization of such. An endemic problem with
using them as such is that some have mapped characters over the ASCII
common set - a prime example being that Shift-JIS replaced the
backslash with a Yen. Those legacy encodings also often require
complex string search logic (Shift-JIS again being a prime example).

For Chinese, the recommended backwards-compatible encoding is GB
18030. This is a good effort but flawed (Decoding it is an absolute
nightmare), and should be converted to a more usable (e.g. UTF-16)
format for in memory use.

> If you know any advantage of UTF-16 then please enlighten me.

UTF-16 is very efficient to work with. Its for this reason that many
languages which adopted Unicode post the expansion of the coding space
still picked it (Python for one). It is an effective tradeoff of space
and speed.
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Reply via email to