https://issues.apache.org/bugzilla/show_bug.cgi?id=47610

           Summary: New hyphenation patterns
           Product: Fop
           Version: all
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: normal
          Priority: P2
         Component: general
        AssignedTo: fop-dev@xmlgraphics.apache.org
        ReportedBy: spepp...@apache.org


Created an attachment (id=24069)
 --> (https://issues.apache.org/bugzilla/attachment.cgi?id=24069)
classes for hyphenation, generated from UnicodeData.txt

The TeX people are now moving to Unicode based TeX engines. Therefore they
created new hyphenation pattern files in utf-8 encoding, see
http://www.ctan.org/tex-archive/language/hyph-utf8/tex/generic/hyph-utf8/patterns/
and
http://tug.org/svn/texhyphen/trunk/hyph-utf8/tex/generic/hyph-utf8/patterns/.
These pattern files can be directly transformed into XML format and used in
FOP. I tested a few, and had no problems.

They lack one thing, however, classes. FOP uses classes to determine what is a
letter (only words consisting of letters will be hyphenated) and the LC/UC
mapping. TeX gets the classes from its Unicode setup, see e.g.
http://scripts.sil.org/svn-public/xetex/TRUNK/texmf/tex/generic/xetex/unicode-letters.tex.
I have tried to do the same, and I attach the result. These classes would be
valid for each hyphenation pattern file. Some localizations seem to have their
own variants of the LC/UC mapping, but I have not investigated that.

The classes were generated as follows: Roughly, each character that is its own
LC generates a class. Its UC and TC (title case character) are added to the
class. More precisely, the selection of characters generating a class was done
as follows:
1. In the first plane,
2. Category Ll or Lu or Lt and its own LC character, or category Lo,
3. Not in the following blocks: Superscripts and Subscripts, Letterlike
Symbols, Alphabetic Presentation Forms, Halfwidth and Fullwidth Forms, CJK
Unified Ideographs, CJK Unified Ideographs Extension A, Hangul Syllables.

We can do two things: Add these classes to each hyphenation file, or add them
to the code that generates the hyphenation trie, preferably to be read from a
separate file. I prefer the latter option. What do you think?

-- 
Configure bugmail: https://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

Reply via email to