On Monday 15 January 2007 07:05, J.Pietschmann wrote: > Manuel Mall wrote: > > 2. Unicode text boundaries (UAX#29) especially word boundaries. Do > > we need this? It does not determine the word breaks to which the > > word spacing property is applied to as this is determined by the > > treat-as-word-space property. It could be used to determine the > > words for hyphenation. > > Probably not worth the trouble. If the pattern based hyphenator is > used, words for hyphenation are determined by the character classes > declared there. Well, unless someone changes this. > > > 3. Normalisation (UAX#15): Do we need this? Do we need to feed > > words in some normalised form to the hyphenation. > > Yes, most definitely. This is a major factor in keeping the pattern > based hyphenator both reasonably robust and small for languages which > use combined characters, obviously. > > > Other uses for this? > > Font selection in combination with character substitution. Ligatures > and character shaping.
Joerg, can you elaborate on this for me please. I am in no way an expert on Unicode and any hints are useful. To keep it simple (for us :-)) lets stay with German as an example. In unicode an 'umlaut' can be represented as 1 or 2 codepoints. What in your opinion should fop do either a codepoint which can be split into two or vice versa? > > There are libraries which already implement UAX#15 properly, e.g. > icu4j, but especially icu4j is a rather large blob of a jar. I think > Unicode normalization should be handled like PDF encryption: do it if > the library is available, otherwise emit a warning and simply skip > the step. > > Maybe BIDI can be done the same way, using the Java 1.4ff libs while > keeping some 1.3 compatibility (just without BIDI). > > > 4. Treatment of combining forms: What should / must we do with > > those character combinations? > > > > 5. Formatting control: Word joiners etc.. These need at least be > > discarded and not given to the renderers. > > This depends on the renderer. I'm not sure what PDF would do with it, > but RTF definitely should get them. While the RTF spec doesn't > mention anything about this topic, I'd say RTF visualization is done > using advanced renderers, which should take care of character shaping > etc. itself. RTF is a special case renderer as it bypasses layout so it wouldn't be affected any way by what I suggest as these are layout functions. I noticed that PDF prints a # for a word joiner for example. That's why I thought that most Cf code points should be dealt with in layout and not be passed to the renderers. > > J.Pietschmann Manuel
