Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Michal Suchanek Sun, 27 Jun 2010 02:08:35 -0700

On 26 June 2010 23:47, Owen Shepherd <owen.sheph...@e43.eu> wrote:
> On 26 June 2010 20:59, Michal Suchanek <hramr...@centrum.cz> wrote:
>>


>>>>>
>>>>> SCSU is of course a poor choice for an in-memory format (Use UTF-16)
>>>>> or interacting with the console (For backwards compatibility you're
>>>>> probably going to have to use UTF-8). But for a storage format,
>>>>> particularly one embedded within a database? It's pretty much perfect.
>>>>
>>>> Anybody who suggests to use UTF-16 for anything has no idea about
>>>> useful encodings in my book. UTF-16 has no advantage whatsoever, only
>>>> disadvantages.
>>>
>>> Would you care to enumerate your points then?
>>>
>>
>> UTF-8 is endianness independent and null-free, UTF-16 is not. In
>> transport losing a byte (or a packet with unknown, possibly odd number
>> of bytes) may corrupt at most one character of UTF-8, it may misalign
>> the whole stream of UTF-16.
>
> I said UTF-16 /in memory/. Not for transport. Whole different kettle of fish
>
>> UTF-32 is dword aligned, you can index into it as an array and every
>> position is a codepoint. UTF-16 has surrogate pairs so you have to
>> decode the whole string to get at codepoints.
>
> You rarely need to index into it at code-point intervals. For most things
> pointers are sufficient
>
> And you should note that dword is a rather vague term; I somehow
> presume you are referring to the x86' 32-bit double word (Which is not
> even consistent in x86 documentation - the i386 SysV ABI used by all
> unixlikes takes a word to be 32-bits).
>
> (I could also mention that every index in a UTF-16 string is also
> technically a codepoint, but lets not get into a battle of semantics;
> the correct term for what you are referring to is a scalar value).

Well, it isn't any more than it is in UTF-8. Some braindead runtimes
expected that the (16bit) word in UTF-16 actually is a codepoint and
had to be fixed later, and there are issues with legacy code which
still expects the old behaviour.

It may be that on many CPUs it is more time efficient if the branch
which reads more than once from the string to get the codepoint is
executed rarely but then the ultimate efficiency is achieved with
UTF-32 which is 32-bit aligned which is by far the fastest on most
CPUs and needs no branching at all at this level. Also for short
strings reduced code complexity outweights any savings by string
compression by using more space-efficient encoding.

>
>> I know no language for which UTF-16 is storage-efficient. For
>> languages using Latin UTF-8 or legasy encodings are about twice as
>> efficient. For Cyrrilic legacy encodings are much more efficient, I
>> don't know how UTF-16 compares to UTF-8 here. For CJK UTF-16 is about
>> 2/3 of UTF-8 but more efficient alternative encodings exist and are in
>> widespread use.
>
> Said more efficient alternative encodings are not Unicode and should
> not be considered a serialization of such. An endemic problem with
> using them as such is that some have mapped characters over the ASCII
> common set - a prime example being that Shift-JIS replaced the
> backslash with a Yen. Those legacy encodings also often require
> complex string search logic (Shift-JIS again being a prime example).
>
> For Chinese, the recommended backwards-compatible encoding is GB
> 18030. This is a good effort but flawed (Decoding it is an absolute
> nightmare), and should be converted to a more usable (e.g. UTF-16)
> format for in memory use.

These are not Unicode, they are developed for Japanese and Chinese,
respectively.

And I don't see why you promote SCSU which maps just about anything
over ASCII and has shifts yet you bash SJIS for mapping Yen over
backslash and having shifts. There is an issue with legacy software
using Yen instead of backslash but that is not an issue with the
encoding itself, it's an issue with how it is abused. It's still a
valid reason to avoid the encoding since it makes the semantics of
these incorrectly used codes ambiguous.

Now SCSU might be easier to decode which makes the required code
smaller in comparison to other encodings but SCSU is not widely
supported and would require constant recoding from the system encoding
and web encoding which inflates the required code again.

>
>> If you know any advantage of UTF-16 then please enlighten me.
>
> UTF-16 is very efficient to work with. Its for this reason that many
> languages which adopted Unicode post the expansion of the coding space
> still picked it (Python for one). It is an effective tradeoff of space
> and speed.

I don't see the efficiency. It's an approach in the middle which does
not really work well for any case and people hope it doesn't turn out
too bad. Now with fast processors and cheap RAM optimization is low
priority and pretty much anything goes as long as it works correctly.
The lack of need for efficiency does not make UTF-16 efficient, tough.

But if you strive for efficiency then I don't see any need for an
"internal encoding" in fossil. Fossil processes strings only once:
when you commit a changeset it cuts an excerpt of the commit message
and stores it for use in timeline and such.

Thanks

Michal
_______________________________________________
fossil-users mailing list
fossil-users@lists.fossil-scm.org
http://lists.fossil-scm.org:8080/cgi-bin/mailman/listinfo/fossil-users

Re: [fossil-users] Mix of UTF-8 and CP1251 (Russian cyrillic) in project

Reply via email to