Re: Hyphenation foundry [was: Re: proposed font project]

J.Pietschmann 16 Jun 2004 21:16:12 -0000

Simon Pepping wrote:

I think it is time to create a project for the hyphenation files at
Sourceforge. The project should be a home for all sorts of accessories
to FOP, or even to FO processors in general. Do you want to
participate? Do you know a nice name?


Well, sf.net would appeal to a larger body of developers, I think,
and is certainly easier to menage for small projects, but we
can also ask on jakarta-commons, xml-commons and even declare it
a FOP (or XML graphics) subproject.

Anyway, I just uploaded
 http://cvs.apache.org/~pietsch/t.tar.gz
which contains several unfinished stuff I produced the last year:
- Utilities to generate tables for the Unicode line break property
- A class keeping a line break state according to TR14, which should
  be easier to usee than the java.text.BreakIterator for FOP
- A Java port of MySpell
- An attempt at providing a layered hierarchy for spell checking
 and hyphenation interfaces.
- A Java port of the link grammar parser (incomplete, badly designed,
 buggy and without approvement of the original authors, *please* use
 only for personal study, don't redistribute).
- An attempt at a morphological analyzer for german words.
Somehow, the simple port of patgen as well as other attempts at
simplifying the current FOP hyphenator are missing, I hope I
remember to upload them tomorrow.

If someone want some problems to chew on:
- Implementation of an optimized trie or ternary or PATRICIA tree.
 Issues here: The FOP implementation packs both tree construction and
 retrieval into a single class, while the data structure is WORM.
 Furthermore, while it is fast, it could be implemented with much
 less memory, especially peak memory during construction. I ultimately
 concluded compiling the data into Java bytecode would be the best.
 Consider inserting the words WORD and WORM. A PATRICIA tree would
 collapse this to
   root: WOR -> leaf D
             -> leaf M
 In order to map this, the root node gets an operation "match string"
 with the string "WOR" leading to the subtree. Statistical compression
 could optimize the necessary operation, like "switch array", match
 2char string, match 3char string, match n-char string etc. May utilize
 BCEL.
- Institutionalized alphabet transformation. This is somewhat of a
 generalization of the hyphenation character classes. Java uses 16bit
 characters, but in many languages it is rare that more than 256
 characters are actually used in words. TeX/PatGen also map the
 characters onto the numbers 1..N (<256), folding character
 classification into the process. Mapping chars onto bytes saves almost
 half the memory. Because there are languages which requires more than
 256 characters, at least two implementation of the trie/whatever
 holding the patterns are necessary, one where the keys are byte
 sequences, another with char sequences. Too bad generics aren't ready
 yet, but if the data is byte compiled into a Java class, the compiler
 may analyze the patterns and decide whether bytes are sufficient.
 Stuff like Unicode character normalization should probably be folded
 into the classification/alphabet transformation too. It would be too
 bad if hyphenation failed because someone decided to use unnormalized
 characters like FI LIGATURE.
- API design. Need a hierarchy of interfaces which allow polymorphy
 at various levels:
  + Hyphenator
      implementations: pattern hyphenator, dictionary hyphenator,
      composite hyphenator: delegate to a collection of child
      hyphenators
  + Pattern hyphenator - pattern storage
     implementations: HashTable (very easy to understand but slow),
     R/W-trie, optimized WORM class, ...
  + Dictionary hyphenator - dictionary ...
 For reuse in interactive applications, R/W storage may be useful (user
 dictionaries)
- Generalized line breaking strategies. Possible strategies
 + naive, break before the first non-space after a space
 + TR14
 + break before any character
 + pattern, regexp or dictionary pased
- Other ideas: API for processing the Unicode data files. Optimized
 compile for Unicode properties into Java class data: select the
 properties you want, get it. Use this to get the latest Unicode data
 into your Java applications rather than the outdated stuff in the
 JRE.


J.Pietschmann

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Hyphenation foundry [was: Re: proposed font project]

Reply via email to