Re: Unicode issues

J.Pietschmann Sun, 14 Jan 2007 14:06:00 -0800

Manuel Mall wrote:

2. Unicode text boundaries (UAX#29) especially word boundaries. Do weneed this? It does not determine the word breaks to which the wordspacing property is applied to as this is determined by thetreat-as-word-space property. It could be used to determine the wordsfor hyphenation.


Probably not worth the trouble. If the pattern based hyphenator is used,
words for hyphenation are determined by the character classes declared
there. Well, unless someone changes this.

3. Normalisation (UAX#15): Do we need this? Do we need to feed words insome normalised form to the hyphenation.


Yes, most definitely. This is a major factor in keeping the pattern
based hyphenator both reasonably robust and small for languages which
use combined characters, obviously.

Other uses for this?


Font selection in combination with character substitution. Ligatures
and character shaping.

There are libraries which already implement UAX#15 properly, e.g. icu4j,
but especially icu4j is a rather large blob of a jar. I think Unicode
normalization should be handled like PDF encryption: do it if the
library is available, otherwise emit a warning and simply skip the
step.

Maybe BIDI can be done the same way, using the Java 1.4ff libs  while
keeping some 1.3 compatibility (just without BIDI).

4. Treatment of combining forms: What should / must we do with thosecharacter combinations?
5. Formatting control: Word joiners etc.. These need at least bediscarded and not given to the renderers.


This depends on the renderer. I'm not sure what PDF would do with it,
but RTF definitely should get them. While the RTF spec doesn't
mention anything about this topic, I'd say RTF visualization is done
using advanced renderers, which should take care of character shaping
etc. itself.

J.Pietschmann

Re: Unicode issues

Reply via email to