Summary: New hyphenation patterns
Created an attachment (id=24069)
classes for hyphenation, generated from UnicodeData.txt
The TeX people are now moving to Unicode based TeX engines. Therefore they
created new hyphenation pattern files in utf-8 encoding, see
These pattern files can be directly transformed into XML format and used in
FOP. I tested a few, and had no problems.
They lack one thing, however, classes. FOP uses classes to determine what is a
letter (only words consisting of letters will be hyphenated) and the LC/UC
mapping. TeX gets the classes from its Unicode setup, see e.g.
I have tried to do the same, and I attach the result. These classes would be
valid for each hyphenation pattern file. Some localizations seem to have their
own variants of the LC/UC mapping, but I have not investigated that.
The classes were generated as follows: Roughly, each character that is its own
LC generates a class. Its UC and TC (title case character) are added to the
class. More precisely, the selection of characters generating a class was done
1. In the first plane,
2. Category Ll or Lu or Lt and its own LC character, or category Lo,
3. Not in the following blocks: Superscripts and Subscripts, Letterlike
Symbols, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, CJK
Unified Ideographs, CJK Unified Ideographs Extension A, Hangul Syllables.
We can do two things: Add these classes to each hyphenation file, or add them
to the code that generates the hyphenation trie, preferably to be read from a
separate file. I prefer the latter option. What do you think?
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.