[lingu-dev] Hyphen 2.4: hyphenmin, compound hyphenation, improved en_US hyphenation patterns

Németh László Wed, 07 May 2008 03:21:52 -0700

Hi,

New version of the Hyphen hyphenator has default hyphenmin and
optional compound word hyphenation support, also improved en_US
hyphenation patterns.


The Hyphen hyphenator (standalone version of OpenOffice.org ALTLinux
Libhnj) is the default hyphenator of OpenOffice.org on several
platforms (Debian, Fedora, Ubuntu). Integration with OpenOffice.org
(also the improved hyphenation patterns) is under development.

Source distribution: http://downloads.sourceforge.net/hunspell/hyphen-2.4.tar.gz

Release notes:

2008-05-01 Hyphen 2.4 release:
  - compound word hyphenation support by recursive pattern matching
    based on two hyphenation pattern sets, see README.compound.
    Especially useful for languages with arbitrary number of compounds (Danish,
    Dutch, Finnish, German, Hungarian, Icelandic, Norwegian, Swedish etc.).

  - new dictionary parameters (minimal character numbers for hyph. distances):
    LEFTHYPHENMIN: minimal hyphenation distance from the left end of the word
    RIGHTHYPHENMIN: minimal hyphenation distance from the right end of the word
    COMPOUNDLEFTHYPHENMIN: min. hyph. dist. from the left compound word boundary
    COMPOUNDRIGHTHYPHENMIN: min. hyph. dist. from the right comp. word boundary

  - new API function: hnj_hyphen_hyphenate3() (like hyphenate2(), but
    with hyphenmin options)

en_US hyphenation patterns:

  - extended hyph_en_US.dic with TugBoat hyphenation log (fix thousand
    incompletely or badly hyphenated words, for example acad-e-my, acro-nym,
    acryl-amide, adren-a-line, aero-space, am-phet-a-mine, anom-aly etc.)

  - fixed hyph_en_US.dic: set the right default hyphenation distance of
    the original TeX hyphenation patterns:
    LEFTHYPHENMIN 2
    RIGHTHYPHENMIN 3 (not 2!)
    It is not only a typographical issue. It seems, TeX hyphenation
    patterns are right only with these settings, for example,
    the bad "anoma-ly" is restricted in TeX only by the default
    \righthyphenmin=3 (but not restricted in OpenOffice.org, until now).

  - documentation (README_hyph_en_US.dic)

  - fixes for automake configuration, compiling and checking, see ChangeLog

On the practical usage of the new extension: see README.compound in
the source distribution. More documentation and development tools for
the extended hyphenation patterns are planned. It is suggested that
the (future) hyphenation dictionary developers of the related
languages collect all common non-compound words and sign compound word
boundaries in its hpyhenation dictionaries (the source of the
hyphenation patterns).

FSF.hu Foundation, Hungary (http://www.fsf.hu) was the main supporter
of the work.

Regards,
László Németh

2008/3/6 Németh László <[EMAIL PROTECTED]>:
> Dear Ruud and all,
>
>  2008/3/4, Ruud Baars <[EMAIL PROTECTED]>:
>
> > László, could you help with the following questions:
>  >
>  >  1) What is the best moment to take into account that a word cannot
>  >  hyphenate leaving 1 char alone at start of end? This is (in Dutch) even
>  >  true in compounds, where one char of the part's cannot be left alone !
>  >  So far, i have taken this into account while making the TeX patterns using
>  >  patgen.
>
>  Now the best method for languages with open compounding is collecting
>  millions of real compound words (eg. from web pages and analyzing them
>  with Hunspell and its upcoming -m (analyze) option), making a huge
>  hyphenated dictionary for pattern generation.
>
>
>  >  2) The is ofthen a hyphenation conflict in the compounding. E.g : a ch is
>  >  never split in Dutch (it is like a g), unless in (rare) compounds like
>  >  tic+hand. Patgen treats this by creating rules and exceptions. This
>  >  generates an rather large pattern file. Did anyone ever try using full
>  >  (uncompounded) words as (perfect) patterns? Would that be feasible (it's
>  >  is easier to maintain the least ...)
>  >  I also have a (very slow) php-program that generates only perfect
>  >  patterns, without exceptions. Is that a path that might be feasible?
>
>  Full words in hyphenation patterns generate too many data after
>  substrings.pl conversion (for example, half million patterns instead
>  of 100 thousand in a real example). I have also written a perfect
>  pattern generator in Perl to solve this problem with size
>  optimization. Full words need only for the learning and test corpus.
>
>
>  >  3) What is the way to explicitly code compound boundaries ? I saw
>  >  something like .. ? How does the (un)compounding work in hyphenation ?
>
>  Decomposition is supported in hyphenation by learning data, so the
>  resulted patterns will hyphenate only this data perfectly. I plan to
>  use Hunspell for decomposition, but it is also not perfect for all
>  possible compounds. I will test the following lightweight "compound
>  hyphenation level" patch. The hyphenation dictionary development will
>  be consist from two phases: the compound and the non compound pattern
>  generation and the integration of these patterns. Some of the
>  hyphenation levels hyphenate only at compound boundaries ("compound
>  hyphenation levels"), for example level 5 and level 7:
>
>  COMPOUNDLEVEL 57
>  .tic5han
>  ...
>
>  The hyphenator will break the hyphenated words at compound break points,
>  and rehyphenate the parts, for example ti3c5hand hyphenation is
>  hyphenated as hyphenate(tic) and hyphenate(hand), so the bad break
>  point (ti-chand) will be eliminated. Advantages of this method are the
>  better compound decomposition, the optional hyphenation break distance
>  from compound breaks (it might be a hyphenation option in
>  OpenOffice.org, too), and maybe the limited perfect pattern generation
>  (only for the compound breaks).
>
>
>  >  4) More hunspell-like : dus the uncompounding also support additonal
>  >  characters? In Dutch (and German) koningshuis uncompounds to
>  >  koning+s+huis. (konings is not a word) Can the uncompounding support this?
>  >  Does uncompounding in some way relate to compounding rules in hunspell?
>
>  I think, with the suggested compound hyphenation level feature, the
>  hyphenator will be handle better this morpheme, because the
>  hyphenation will be more based on compound breaks. In your example,
>  you will be able to use konings|haus decomposition for the compound
>  hyphenation level (if you need, also adding the non word "konings" to
>  your hyphenation dictionary).
>
>
>  >  You see, i am trying to get a picture of the entire process, to make the
>  >  hyphenation as perfect as it can. I think it is better to not hyphenate
>  >  then to hyphenate wrongly.
>
>  I believe, the aim of the hyphenation is the perfect typesetting, not
>  the perfect
>  orthography, so better to hyphenate (especially the long compound
>  words), then not. Fortunatelly, the pattern based hyphenation of
>  TeX/OpenOffice.org supports these and other extraordinary (for
>  example, mistyped) cases.
>
>
>  >  Compounding is an important issue in this (valk-uil and val-kuil are both
>  >  valid compounds, and there is no way to decide which is correct without
>  >  doing content analysis.)
>
>  Ambiguous compound hyphenation (valk|uil, val|kuil or the Hungarian
>  leg|elő-re, le-ge-lő-re) can be forbidden on a compound hyphenation
>  level, for example:
>
>  COMPOUNDLEVEL 567
>  .val6k6uil.
>  .leg6előre.
>
>
>  >  Hope you can help scetch me the big picture.
>  >
>  >  Ruud
>
>  I hope, too. :)
>  I have also posted this letter to lang-dev.
>
>  Best regards,
>  László
>

[lingu-dev] Hyphen 2.4: hyphenmin, compound hyphenation, improved en_US hyphenation patterns

Reply via email to