Re: Unicode issues

Manuel Mall Sun, 14 Jan 2007 20:42:12 -0800

On Monday 15 January 2007 07:05, J.Pietschmann wrote:
> Manuel Mall wrote:
> > 2. Unicode text boundaries (UAX#29) especially word boundaries. Do
> > we need this? It does not determine the word breaks to which the
> > word spacing property is applied to as this is determined by the
> > treat-as-word-space property. It could be used to determine the
> > words for hyphenation.
>
> Probably not worth the trouble. If the pattern based hyphenator is
> used, words for hyphenation are determined by the character classes
> declared there. Well, unless someone changes this.
>
> > 3. Normalisation (UAX#15): Do we need this? Do we need to feed
> > words in some normalised form to the hyphenation.
>
> Yes, most definitely. This is a major factor in keeping the pattern
> based hyphenator both reasonably robust and small for languages which
> use combined characters, obviously.
>
> > Other uses for this?
>
> Font selection in combination with character substitution. Ligatures
> and character shaping.


Joerg, can you elaborate on this for me please. I am in no way an expert 
on Unicode and any hints are useful. To keep it simple (for us :-)) 
lets stay with German as an example. In unicode an 'umlaut' can be 
represented as 1 or 2 codepoints. What in your opinion should fop do 
either a codepoint which can be split into two or vice versa?

>
> There are libraries which already implement UAX#15 properly, e.g.
> icu4j, but especially icu4j is a rather large blob of a jar. I think
> Unicode normalization should be handled like PDF encryption: do it if
> the library is available, otherwise emit a warning and simply skip
> the step.
>
> Maybe BIDI can be done the same way, using the Java 1.4ff libs  while
> keeping some 1.3 compatibility (just without BIDI).
>
> > 4. Treatment of combining forms: What should / must we do with
> > those character combinations?
> >
> > 5. Formatting control: Word joiners etc.. These need at least be
> > discarded and not given to the renderers.
>
> This depends on the renderer. I'm not sure what PDF would do with it,
> but RTF definitely should get them. While the RTF spec doesn't
> mention anything about this topic, I'd say RTF visualization is done
> using advanced renderers, which should take care of character shaping
> etc. itself.

RTF is a special case renderer as it bypasses layout so it wouldn't be 
affected any way by what I suggest as these are layout functions. I 
noticed that PDF prints a # for a word joiner for example. That's why I 
thought that most Cf code points should be dealt with in layout and not 
be passed to the renderers.

>
> J.Pietschmann

Manuel

Re: Unicode issues

Reply via email to