Manuel Mall wrote:
2. Unicode text boundaries (UAX#29) especially word boundaries. Do we
need this? It does not determine the word breaks to which the word
spacing property is applied to as this is determined by the
treat-as-word-space property. It could be used to determine the words
Probably not worth the trouble. If the pattern based hyphenator is used,
words for hyphenation are determined by the character classes declared
there. Well, unless someone changes this.
3. Normalisation (UAX#15): Do we need this? Do we need to feed words in
some normalised form to the hyphenation.
Yes, most definitely. This is a major factor in keeping the pattern
based hyphenator both reasonably robust and small for languages which
use combined characters, obviously.
Other uses for this?
Font selection in combination with character substitution. Ligatures
and character shaping.
There are libraries which already implement UAX#15 properly, e.g. icu4j,
but especially icu4j is a rather large blob of a jar. I think Unicode
normalization should be handled like PDF encryption: do it if the
library is available, otherwise emit a warning and simply skip the
Maybe BIDI can be done the same way, using the Java 1.4ff libs while
keeping some 1.3 compatibility (just without BIDI).
4. Treatment of combining forms: What should / must we do with those
5. Formatting control: Word joiners etc.. These need at least be
discarded and not given to the renderers.
This depends on the renderer. I'm not sure what PDF would do with it,
but RTF definitely should get them. While the RTF spec doesn't
mention anything about this topic, I'd say RTF visualization is done
using advanced renderers, which should take care of character shaping