Herbert Duerr wrote:
To support characters outside of the unicode base plane I'd like to add a
new sal_UCS4 type to OpenOffice.

The currently used sal_Unicode type is not sufficient for these characters.
Also the use of sal_Unicode is ambigous in OOo. In interfaces with a scalar
sal_Unicode it means a character encoded in UCS-2, in interfaces with an
array of sal_Unicodes it means UTF-16 encoded characters.

Interfaces that currently only take a sal_Unicode in the meaning UCS-2 are
broken by design regarding unicode surrogates. At least for the internal
interfaces the easiest fix is to change their signature to use UCS-4
instead of scalar sal_Unicodes. For the external interfaces with the design
bug mentioned above new methods that are capable of handling unicodes
outside the base plane should be added.

See the thread at <http://www.openoffice.org/servlets/ReadMsg?listName=dev&msgNo=18462> for ideas how to change interfaces (sometimes it is better to replace sal_Unicode with rtl::OUString etc.).

UCS-2 and UCS-4 are not Unicode (<www.unicode.org>) terms. Instead, Unicode is built on the concepts of _Unicode codespace_ (D4a), the range of integers 0x0--10FFFF; _Unicode scalar value_ (D28), the codespace minus the surrogates; and _code unit_ (D28a), the 8/16/32 bit units on which the UTF-8/16/32 encoding forms are based.

sal_Unicode represents a UTF-16 code unit (without any ambiguity).

Of course the interfaces could be changed to something like sal_uInt32, but
then a lot of interesting type information would be lost.

I am somewhat indifferent here. We often use plain integer types to represent numerical quantities (be it lengths in mm, pixel sizes, Unicode scalar values), for better or worse. And at least I consistently used sal_uInt32 everywhere in the OOo code base I needed to represent Unicode scalar values or UTF-32 code units.

Though the first step of adding a sal_UCS4 type to sal/types.h seemed to be
uncontroversial there was significant opposition to this idea. So I'd like
to collect the arguments against it:
- sal_uInt32 as an alternative is a good enough

In my (pragmatic) eyes: yes.

- a typedef to sal_uInt32 is not good enough

Typedefs in C++ are, well, strange beasts. As a client you often have to be aware of exactly what other type the typedef aliases (e.g., when declaring overloaded functions, when using varargs, printf, when determining whether there is an appropriate streaming operator <<, when building expressions on integer types).

- unicode values beyond 2^32 are not unthinkable

How do you come to think that?  ;)

Did I miss any important issues against adding a sal_UCS4 type?

Would that be "sal_UCS4" or "sal_Ucs4"?

--
Herbert

-Stephan

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to