Simon Pepping wrote:
I think it is time to create a project for the hyphenation files at
Sourceforge. The project should be a home for all sorts of accessories
to FOP, or even to FO processors in general. Do you want to
participate? Do you know a nice name?
Well, sf.net would appeal to a larger body of developers, I think,
and is certainly easier to menage for small projects, but we
can also ask on jakarta-commons, xml-commons and even declare it
a FOP (or XML graphics) subproject.
Anyway, I just uploaded
http://cvs.apache.org/~pietsch/t.tar.gz
which contains several unfinished stuff I produced the last year:
- Utilities to generate tables for the Unicode line break property
- A class keeping a line break state according to TR14, which should
be easier to usee than the java.text.BreakIterator for FOP
- A Java port of MySpell
- An attempt at providing a layered hierarchy for spell checking
and hyphenation interfaces.
- A Java port of the link grammar parser (incomplete, badly designed,
buggy and without approvement of the original authors, *please* use
only for personal study, don't redistribute).
- An attempt at a morphological analyzer for german words.
Somehow, the simple port of patgen as well as other attempts at
simplifying the current FOP hyphenator are missing, I hope I
remember to upload them tomorrow.
If someone want some problems to chew on:
- Implementation of an optimized trie or ternary or PATRICIA tree.
Issues here: The FOP implementation packs both tree construction and
retrieval into a single class, while the data structure is WORM.
Furthermore, while it is fast, it could be implemented with much
less memory, especially peak memory during construction. I ultimately
concluded compiling the data into Java bytecode would be the best.
Consider inserting the words WORD and WORM. A PATRICIA tree would
collapse this to
root: WOR -> leaf D
-> leaf M
In order to map this, the root node gets an operation "match string"
with the string "WOR" leading to the subtree. Statistical compression
could optimize the necessary operation, like "switch array", match
2char string, match 3char string, match n-char string etc. May utilize
BCEL.
- Institutionalized alphabet transformation. This is somewhat of a
generalization of the hyphenation character classes. Java uses 16bit
characters, but in many languages it is rare that more than 256
characters are actually used in words. TeX/PatGen also map the
characters onto the numbers 1..N (<256), folding character
classification into the process. Mapping chars onto bytes saves almost
half the memory. Because there are languages which requires more than
256 characters, at least two implementation of the trie/whatever
holding the patterns are necessary, one where the keys are byte
sequences, another with char sequences. Too bad generics aren't ready
yet, but if the data is byte compiled into a Java class, the compiler
may analyze the patterns and decide whether bytes are sufficient.
Stuff like Unicode character normalization should probably be folded
into the classification/alphabet transformation too. It would be too
bad if hyphenation failed because someone decided to use unnormalized
characters like FI LIGATURE.
- API design. Need a hierarchy of interfaces which allow polymorphy
at various levels:
+ Hyphenator
implementations: pattern hyphenator, dictionary hyphenator,
composite hyphenator: delegate to a collection of child
hyphenators
+ Pattern hyphenator - pattern storage
implementations: HashTable (very easy to understand but slow),
R/W-trie, optimized WORM class, ...
+ Dictionary hyphenator - dictionary ...
For reuse in interactive applications, R/W storage may be useful (user
dictionaries)
- Generalized line breaking strategies. Possible strategies
+ naive, break before the first non-space after a space
+ TR14
+ break before any character
+ pattern, regexp or dictionary pased
- Other ideas: API for processing the Unicode data files. Optimized
compile for Unicode properties into Java class data: select the
properties you want, get it. Use this to get the latest Unicode data
into your Java applications rather than the outdated stuff in the
JRE.
J.Pietschmann
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]