Unicode---Give us all of it!
============================
Unicode encodes characters in a codespace that ranges from 0 to
0x10FFFF. Much of the OOo code base operates on UTF-16 code units that
range from 0 to 0xFFFF:
- C/C++ code based on sal_Unicode.
- Java code based on Java char.
- UNO based on UNO CHAR.
It is obvious that a single UTF-16 code unit cannot represent all of
Unicode. Thus, UTF-16 is designed in such a way that each Unicode
character can be represented in UTF-16 as an ordered sequence of at most
two code units: Characters in the ranges U+0000--D7FF and U+E000--FFFF
are represented by a single UTF-16 code unit (of the respective numeric
value). Characters in the range U+10000--10FFFF are represented by two
UTF-16 code units, a high surrogate in the range 0xD800--DBFF followed
by a low surrogate in the range 0xDC00--DFFF.
In turn, it should be obvious that treating single UTF-16 code units as
representing Unicode characters does not work. However, since most
actually used Unicode characters are in the range U+0000--FFFF (and can
hence faithfully be represented by a single UTF-16 code unit), this
problem is not apparent in all situations. This will gradually change
as Unicode characters in the range U+10000--10FFFF are used more and
more frequently, especially in East Asian locales. And this should be
motivation to enhance OOo so that all parts of it work flawlessly with
all of Unicode.
In Java 5, this problem has been addressed by augmenting functionality
based on Java char single UTF-16 code units (e.g., String.charAt) with
functionality based on Java int (0--0x10FFFF) Unicode encoded characters
(e.g., String.codePointAt), and by using functionality based on
java.lang.String UTF-16 code unit sequences. Similar solutions are
needed for C/C++ code and UNO APIs.
A related problem is that Unicode combining character sequences like
U+0041 LATIN CAPITAL LETTER A followed by U+20E3 COMBINING ENCLOSING
KEYCAP shall be treated as single characters in certain applications.
(For example, if you can specify the bullet symbol that shall preceed
each list item you enter in a word process, combining character
sequences could be useful choices for such a symbol.) This indicates
that an application's concept of "character" is often best represented
by a programming environment's concept of "string."
C/C++ Code
----------
The approach here has two parts:
Use sal_uInt32 to represent individual Unicode encoded characters and
add any necessary base functionality to rtl::OUString (e.g., operating
on the individual Unicode encoded characters represented by an instance
of rtl::OUString).
Find all the places in the code that need to be adapted.
Java Code
---------
No Java code within OOo that would need to be adapted has been
identified. (Any necessary adaption would be complicated by the fact
that OOo shall be compatible with Java 1.3.1, so that much of the
functionality offered by Java 5 would not be available.)
UNO APIs
--------
Replace (if unpublished) or supersede (if published) any API that uses
CHAR with a corresponding API that uses STRING. Find attached a list of
all occurences of CHAR within the API (types.rdb) of SRC680m193.
How to proceede
---------------
In a first step, I will try to identify and gather as many places in OOo
that need to be adapted, but I need your help for that: IF YOU KNOW OF
ANY PLACE IN OOo THAT NEEDS TO BE ADAPTED, PLEASE LET ME KNOW.
Once all places have been identified, we can see how to address the task
of adapting them accordingly.
-Stephan
com/sun/star/accessibility/XAccessibleText: char getCharacter([in] long nIndex)
com/sun/star/awt/KeyEvent: char KeyChar
com/sun/star/awt/KeyStroke: char KeyChar
com/sun/star/awt/SimpleFontMetric: char FirstChar
com/sun/star/awt/SimpleFontMetric: char LastChar
com/sun/star/awt/XFont: sequence<short> getCharWidths([in] char nFirst, [in]
char nLast)
com/sun/star/awt/XFont: short getCharWidth([in] char c)
com/sun/star/awt/XFont: void getKernPairs([out] sequence<char> Chars1, [out]
sequence<char> Chars2, [out] sequence<short> Kerns)
com/sun/star/awt/XTextEditField: void setEchoChar([in] char cEcho)
com/sun/star/i18n/XExtendedInputSequenceChecker: long
correctInputSequence([inout] string aText, [in] long nPos, [in] char
cInputChar, [in] short nInputCheckMode)
com/sun/star/i18n/XExtendedTransliteration: char transliterateChar2Char([in]
char cChar)
com/sun/star/i18n/XExtendedTransliteration: string
transliterateChar2String([in] char cChar)
com/sun/star/i18n/XInputSequenceChecker: boolean checkInputSequence([in] string
aText, [in] long nPos, [in] char cInputChar, [in] short nInputCheckMode)
com/sun/star/io/XDataInputStream: char readChar()
com/sun/star/io/XDataOutputStream: void writeChar([in] char Value)
com/sun/star/io/XTextInputStream: string readString([in] sequence<char>
Delimiters, [in] boolean bRemoveDelimiter)
com/sun/star/style/TabStop: char DecimalChar
com/sun/star/style/TabStop: char FillChar
com/sun/star/test/bridge/TestSimple: char Char
com/sun/star/test/bridge/XBridgeTest2: sequence<char> setSequenceChar([in]
sequence<char> aSeq)
com/sun/star/test/bridge/XBridgeTest2: void setSequencesInOut([inout]
sequence<boolean> aSeqBoolean, [inout] sequence<char> aSeqChar, ...)
com/sun/star/test/bridge/XBridgeTest2: void setSequencesOut([out]
sequence<boolean> aSeqBoolean, [out] sequence<char> aSeqChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: [attribute] char Char
com/sun/star/test/bridge/XBridgeTestBase: com/sun/star/test/bridge/TestData
getValues([out] boolean bBool, [out] char cChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: com/sun/star/test/bridge/TestData
setValues2([inout] boolean bBool, [inout] char cChar, ...)
com/sun/star/test/bridge/XBridgeTestBase: void setValues([in] boolean bBool,
[in] char cChar, ...)
com/sun/star/test/performance/SimpleTypes: char Char
com/sun/star/text/TextSortDescriptor2: [property] char Delimiter
com/sun/star/text/TextSortDescriptor: [property] char Delimiter
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]