[dev] Unicode---Give us all of it!

Stephan Bergmann Fri, 10 Nov 2006 08:12:24 -0800

Unicode---Give us all of it!
============================

Unicode encodes characters in a codespace that ranges from 0 to0x10FFFF. Much of the OOo code base operates on UTF-16 code units thatrange from 0 to 0xFFFF:


- C/C++ code based on sal_Unicode.

- Java code based on Java char.

- UNO based on UNO CHAR.

It is obvious that a single UTF-16 code unit cannot represent all ofUnicode. Thus, UTF-16 is designed in such a way that each Unicodecharacter can be represented in UTF-16 as an ordered sequence of at mosttwo code units: Characters in the ranges U+0000--D7FF and U+E000--FFFFare represented by a single UTF-16 code unit (of the respective numericvalue). Characters in the range U+10000--10FFFF are represented by twoUTF-16 code units, a high surrogate in the range 0xD800--DBFF followedby a low surrogate in the range 0xDC00--DFFF.

In turn, it should be obvious that treating single UTF-16 code units asrepresenting Unicode characters does not work. However, since mostactually used Unicode characters are in the range U+0000--FFFF (and canhence faithfully be represented by a single UTF-16 code unit), thisproblem is not apparent in all situations. This will gradually changeas Unicode characters in the range U+10000--10FFFF are used more andmore frequently, especially in East Asian locales. And this should bemotivation to enhance OOo so that all parts of it work flawlessly withall of Unicode.

In Java 5, this problem has been addressed by augmenting functionalitybased on Java char single UTF-16 code units (e.g., String.charAt) withfunctionality based on Java int (0--0x10FFFF) Unicode encoded characters(e.g., String.codePointAt), and by using functionality based onjava.lang.String UTF-16 code unit sequences. Similar solutions areneeded for C/C++ code and UNO APIs.

A related problem is that Unicode combining character sequences likeU+0041 LATIN CAPITAL LETTER A followed by U+20E3 COMBINING ENCLOSINGKEYCAP shall be treated as single characters in certain applications.(For example, if you can specify the bullet symbol that shall preceedeach list item you enter in a word process, combining charactersequences could be useful choices for such a symbol.) This indicatesthat an application's concept of "character" is often best representedby a programming environment's concept of "string."



C/C++ Code
----------

The approach here has two parts:

Use sal_uInt32 to represent individual Unicode encoded characters andadd any necessary base functionality to rtl::OUString (e.g., operatingon the individual Unicode encoded characters represented by an instanceof rtl::OUString).


Find all the places in the code that need to be adapted.


Java Code
---------

No Java code within OOo that would need to be adapted has beenidentified. (Any necessary adaption would be complicated by the factthat OOo shall be compatible with Java 1.3.1, so that much of thefunctionality offered by Java 5 would not be available.)



UNO APIs
--------

Replace (if unpublished) or supersede (if published) any API that usesCHAR with a corresponding API that uses STRING. Find attached a list ofall occurences of CHAR within the API (types.rdb) of SRC680m193.



How to proceede
---------------

In a first step, I will try to identify and gather as many places in OOothat need to be adapted, but I need your help for that: IF YOU KNOW OFANY PLACE IN OOo THAT NEEDS TO BE ADAPTED, PLEASE LET ME KNOW.

Once all places have been identified, we can see how to address the taskof adapting them accordingly.



-Stephan

com/sun/star/accessibility/XAccessibleText: char getCharacter([in] long nIndex)
com/sun/star/awt/KeyEvent: char KeyChar
com/sun/star/awt/KeyStroke: char KeyChar
com/sun/star/awt/SimpleFontMetric: char FirstChar
com/sun/star/awt/SimpleFontMetric: char LastChar
com/sun/star/awt/XFont: sequence<short> getCharWidths([in] char nFirst, [in] 
char nLast)
com/sun/star/awt/XFont: short getCharWidth([in] char c)
com/sun/star/awt/XFont: void getKernPairs([out] sequence<char> Chars1, [out] 
sequence<char> Chars2, [out] sequence<short> Kerns)
com/sun/star/awt/XTextEditField: void setEchoChar([in] char cEcho)
com/sun/star/i18n/XExtendedInputSequenceChecker: long 
correctInputSequence([inout] string aText, [in] long nPos, [in] char 
cInputChar, [in] short nInputCheckMode)
com/sun/star/i18n/XExtendedTransliteration: char transliterateChar2Char([in] 
char cChar)
com/sun/star/i18n/XExtendedTransliteration: string 
transliterateChar2String([in] char cChar)
com/sun/star/i18n/XInputSequenceChecker: boolean checkInputSequence([in] string 
aText, [in] long nPos, [in] char cInputChar, [in] short nInputCheckMode)
com/sun/star/io/XDataInputStream: char readChar()
com/sun/star/io/XDataOutputStream: void writeChar([in] char Value)
com/sun/star/io/XTextInputStream: string readString([in] sequence<char> 
Delimiters, [in] boolean bRemoveDelimiter)
com/sun/star/style/TabStop: char DecimalChar
com/sun/star/style/TabStop: char FillChar
com/sun/star/test/bridge/TestSimple: char Char
com/sun/star/test/bridge/XBridgeTest2: sequence<char> setSequenceChar([in] 
sequence<char> aSeq)
com/sun/star/test/bridge/XBridgeTest2: void setSequencesInOut([inout] 
sequence<boolean> aSeqBoolean, [inout] sequence<char> aSeqChar, ...)
com/sun/star/test/bridge/XBridgeTest2: void setSequencesOut([out] 
sequence<boolean> aSeqBoolean, [out] sequence<char> aSeqChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: [attribute] char Char
com/sun/star/test/bridge/XBridgeTestBase: com/sun/star/test/bridge/TestData 
getValues([out] boolean bBool, [out] char cChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: com/sun/star/test/bridge/TestData 
setValues2([inout] boolean bBool, [inout] char cChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: void setValues([in] boolean bBool, 
[in] char cChar, ...)
com/sun/star/test/performance/SimpleTypes: char Char
com/sun/star/text/TextSortDescriptor2: [property] char Delimiter
com/sun/star/text/TextSortDescriptor: [property] char Delimiter

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[dev] Unicode---Give us all of it!

Reply via email to