pietsch     2003/07/20 15:27:13

  Modified:    src/documentation/content/xdocs hyphenation.xml
  Log:
  Added comments about pattern file structure and conversion
  of TeX patterns
  
  Revision  Changes    Path
  1.6       +146 -36   xml-fop/src/documentation/content/xdocs/hyphenation.xml
  
  Index: hyphenation.xml
  ===================================================================
  RCS file: /home/cvs/xml-fop/src/documentation/content/xdocs/hyphenation.xml,v
  retrieving revision 1.5
  retrieving revision 1.6
  diff -u -r1.5 -r1.6
  --- hyphenation.xml   15 Jul 2003 18:16:30 -0000      1.5
  +++ hyphenation.xml   20 Jul 2003 22:27:13 -0000      1.6
  @@ -62,40 +62,79 @@
       <title>Custom Hyphenation Support</title>
       <section id="custom-intro">
         <title>Introduction</title>
  -      <p>FOP uses an XML-based TeX-like hyphenation pattern scheme.
  -However, because of <link href="#license-issues">licensing issues</link>, there are 
currently some significant holes in FOP's hyphenation support.
  -The information in this section is intended to help you work around these 
limitations, if possible, add support for other languages, or enhance FOP's support of 
current languages.</p>
  -      <note>If you have access to hyphenation patterns that are licensed in an 
Apache-compatible way, or if you have made improvements to an existing FOP hyphenation 
pattern, or if you have created one from scratch, please consider contributing these 
to FOP so that they can benefit other FOP users as well. Please inquire on the <link 
href="maillist.html#fop-user">FOP User mailing list</link>.</note>
  +      <p>FOP uses Liang's hyphenation algorithm, well known from TeX. It needs
  +       language specific pattern and other data for operation.</p>
  +      <p>Because of <link href="#license-issues">licensing issues</link>,
  +       there are currently some significant holes in FOP's hyphenation support.
  +       The information in this section is intended to help you work around these
  +       limitations, if possible, add support for other languages, or enhance FOP's
  +       support of current languages.</p>
  +      <note>If you have access to hyphenation patterns that are licensed in an
  +       Apache-compatible way, or if you have made improvements to an existing FOP
  +       hyphenation pattern, or if you have created one from scratch, please
  +       consider contributing these to FOP so that they can benefit other FOP users
  +       as well. Please inquire on the <link href="maillist.html#fop-user">FOP User
  +       mailing list</link>.</note>
       </section>
       <section id="custom-license-issues">
         <title>License Issues</title>
  -      <p>Many of the hyphenation files distributed with TeX and its offspring are 
licenced under the <fork href="http://www.latex-project.org/lppl.html";>LaTeX Project 
Public License (LPPL)</fork>, which prevents them from being distributed with Apache 
software.
  -Although Apache FOP cannot redistribute hyphenation pattern files that do not 
conform with its license scheme, that does not necessarily prevent users from using 
such hyphenation patterns with FOP.
  -However, it does place on the user the responsibility for determining whether the 
user can rightly use such hyphenation patterns under the hyphenation pattern 
license.</p>
  -      <warning>The user is responsible to settle license issues for hyphenation 
pattern files that are obtained from non-Apache sources.</warning>
  +      <p>Many of the hyphenation files distributed with TeX and its offspring are
  +       licenced under the <fork href="http://www.latex-project.org/lppl.html";>LaTeX
  +       Project Public License (LPPL)</fork>, which prevents them from being
  +       distributed with Apache software. The LPPL puts restrictions on file names
  +       in redistributed derived works which we feel can't guarantee. Some
  +       hyphenation pattern files have other or additional restrictions, for
  +       example against use for commercial purposes.</p>
  +      <p>Although Apache FOP cannot redistribute hyphenation pattern files that do
  +       not conform with its license scheme, that does not necessarily prevent users
  +       from using such hyphenation patterns with FOP. However, it does place on
  +       the user the responsibility for determining whether the user can rightly use
  +       such hyphenation patterns under the hyphenation pattern license.</p>
  +      <warning>The user is responsible to settle license issues for hyphenation
  +       pattern files that are obtained from non-Apache sources.</warning>
       </section>
       <section id="custom-sources">
         <title>Sources of Custom Hyphenation Pattern Files</title>
  -      <p>The most important source of hyphenation pattern files is the <fork 
href="http://www.ctan.org/tex-archive/language/hyphenation/";>CTAN TeX 
Archive</fork>.</p>
  +      <p>The most important source of hyphenation pattern files is the
  +       <fork href="http://www.ctan.org/tex-archive/language/hyphenation/";>CTAN TeX
  +        Archive</fork>.</p>
       </section>
       <section id="custom-install">
         <title>Installing Custom Hyphenation Patterns</title>
         <p>To install custom a custom hyphenation pattern for use with FOP:</p>
         <ol>
  -        <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP 
format is an xml file conforming to the DTD found at 
<code>{fop-dir}/src/hyph/hyphenation.dtd</code>.</li>
  -        <li>Name this new file following this schema: 
<code>languageCode_countryCode.xml</code>.
  -The country code is optional, and should be used only if needed. For example:
  +        <li>Convert the TeX hyphenation pattern file to the FOP format. The FOP
  +         format is an xml file conforming to the DTD found at
  +         <code>{fop-dir}/src/hyph/hyphenation.dtd</code>.</li>
  +        <li>Name this new file following this schema:
  +         <code>languageCode_countryCode.xml</code>. The country code is
  +          optional, and should be used only if needed. For example:
             <ul>
  -            <li><code>en_US.xml</code> would be the file name for an American 
English hyphenation pattern.</li>
  -            <li><code>it.xml</code> would be the file name for an Italian 
hyphenation pattern.</li>
  +            <li><code>en_US.xml</code> would be the file name for American
  +             English hyphenation patterns.</li>
  +            <li><code>it.xml</code> would be the file name for Italian
  +             hyphenation patterns.</li>
             </ul>
  -The language and country codes must match the XSL-FO input, which follows <link 
href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt";>ISO 639</link> 
(languages) and
  -<link href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt";>ISO 
3166</link> (countries).
  -NOTE: The ISO 639/ISO 3166 convention is that language names are written in lower 
case, while country codes are written in upper case.</li>
  -        <li>There are two ways to make the FOP-compatible hyphenation pattern file 
accessible to FOP:
  +          The language and country codes must match the XSL-FO input, which
  +          follows <link 
href="http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt";>ISO
  +          639</link> (languages) and <link 
href="http://www.ics.uci.edu/pub/ietf/http/related/iso3166.txt";>ISO
  +          3166</link> (countries). NOTE: The ISO 639/ISO 3166 convention is that
  +          language names are written in lower case, while country codes are written
  +          in upper case. FOP does not check whether the language and country 
specified
  +          in the FO source are actually from the current standard, but it relies
  +          on it being two letter strings in a few places. So you can make up your
  +          own codes for custom hyphenation patterns, but they should be two
  +          letter strings too (patches for proper handling extensions are 
welcome)</li>
  +        <li>There are two ways to make the FOP-compatible hyphenation pattern file
  +         accessible to FOP:
             <ul>
  -            <li>Place the FOP-compatible hyphenation pattern file into the 
directory {fop-dir}/src/hyph and rebuild FOP. The file will be picked up and added to 
fop.jar.</li>
  -            <li>Put the file into a directory of your choice and configure FOP to 
look for custom patterns in this directory, by setting the <link 
href="configuration.html#hyphenation-dir">&lt;hyphenation-dir&gt; configuration 
option</link>.</li>
  +            <li>Place the FOP-compatible hyphenation pattern file into the
  +             directory {fop-dir}/src/hyph and rebuild FOP. The file will be picked
  +             up and added to fop.jar.</li>
  +            <li>Put the file into a directory of your choice and configure FOP to
  +             look for custom patterns in this directory, by setting the
  +             <link href="configuration.html#hyphenation-dir">&lt;hyphenation-dir&gt;
  +              configuration option</link>.</li>
             </ul>
           </li>
         </ol>
  @@ -103,28 +142,74 @@
     </section>
     <section id="patterns">
       <title>Hyphenation Patterns</title>
  -    <p>If you would like to build your own hyphenation pattern files, or modify 
existing ones, this section will help you understand how to do so. Even when creating 
a pattern file from scratch, it may be beneficial to start with an existing file and 
modify it. See <link href="#std">Standard Hyphenation Support</link> or the source 
distribution (src/hyph) for examples. Here is a brief explanation of the contents of 
FOP's hyphenation patterns:</p>
  -    <warning>The remaining content of this section should be considered "draft" 
quality. It was drafted from theoretical literature, and has not been tested against 
actual FOP behavior. It may contain errors or omissions. Do not rely on these 
instructions without testing everything stated here. If you use these instructions, 
please provide feedback on the <link href="maillist.html#fop-user">FOP User mailing 
list</link>, either confirming their accuracy, or raising specific problems that we 
can address.</warning>
  +    <p>If you would like to build your own hyphenation pattern files, or modify
  +     existing ones, this section will help you understand how to do so. Even
  +     when creating a pattern file from scratch, it may be beneficial to start
  +     with an existing file and modify it. See <link href="#std">Standard
  +     Hyphenation Support</link> or the source distribution (src/hyph) for
  +     examples. Here is a brief explanation of the contents of FOP's hyphenation
  +     patterns:</p>
  +    <warning>The remaining content of this section should be considered "draft"
  +     quality. It was drafted from theoretical literature, and has not been
  +     tested against actual FOP behavior. It may contain errors or omissions.
  +     Do not rely on these instructions without testing everything stated here.
  +     If you use these instructions, please provide feedback on the
  +     <link href="maillist.html#fop-user">FOP User mailing list</link>, either
  +     confirming their accuracy, or raising specific problems that we can
  +     address.</warning>
       <ul>
         <li>The root of the pattern file is the &lt;hyphenation-info> element.</li>
  -      <li>&lt;hyphen-char> is self-explanatory: its attribute "value" contains the 
default character to be used for hyphenating this language. For English, this is the 
hyphen "-".</li>
  +      <li>&lt;hyphen-char>: its attribute "value" contains the character signalling
  +       a hyphen in the &lt;exceptions> section. It has nothing to do with the
  +       hyphenation character used in FOP, use the XSLFO hyphenation-character
  +       property for defining the hyphenation character there. At some points
  +       a dash U+002D is hardwired in the code, so you'd better use this too
  +       (patches to rectify the situation are welcome). There is no default,
  +       if you declare exceptions with hyphenations, you must declare the
  +       hyphen-char too.</li>
         <li>&lt;hyphen-min> contains two attributes:
           <ul>
  -          <li>before: the minimum number of characters in a word allowed to exist 
on a line immediately preceding a hyphenated word-break.</li>
  -          <li>after: the minimum number of characters in a word allowed to exist on 
a line immediately after a hyphenated word-break.</li>
  +          <li>before: the minimum number of characters in a word allowed to exist
  +           on a line immediately preceding a hyphenated word-break.</li>
  +          <li>after: the minimum number of characters in a word allowed to exist
  +           on a line immediately after a hyphenated word-break.</li>
           </ul>
  +        This element is unused and not even read. It should be considered a
  +        documentation for parameters used during pattern generation.
         </li>
  -      <li>&lt;classes> contains whitespace-separated character sets.
  -The members of each set should be treated as equivalent for purposes of hyphenation.
  -The English patterns, for example, include sets such as "aA" and "bB" to indicate 
that lower case characters should be treated as equivalent to uppercase characters for 
purposes of computing potential hyphenation breaks.</li>
  -      <li>&lt;exceptions> contains whitespace-separated words, each of which has 
either explicit hyphen characters to denote acceptable breakage points, or no hyphen 
characters, to indicate that this word should never be hyphenated.
  -Exceptions override the patterns described below.</li>
  -      <li>&lt;patterns> includes whitespace-separated patterns, which are what 
drive most hyphenation decisions.
  -The characters in these patterns are explained as follows:
  +      <li>&lt;classes> contains whitespace-separated character sets. The members
  +       of each set should be treated as equivalent for purposes of hyphenation,
  +       usually upper and lower case of the same character. The first character
  +       of the set is the canonical character, the patterns and exceptions
  +       should only contain these canonical representation characters (except
  +       digits for weight, the period (.) as word delimiter in the patterns and
  +       the hyphen char in exceptions, of course).</li>
  +      <li>&lt;exceptions> contains whitespace-separated words, each of which
  +       has either explicit hyphen characters to denote acceptable breakage
  +       points, or no hyphen characters, to indicate that this word should
  +       never be hyphenated, or contain explicit &lt;hyp> elements for specifying
  +       changes of spelling due to hyphenation (like backen -> bak-ken or
  +       Stoffarbe -> Stoff-farbe in the old german spelling). Exceptions override
  +       the patterns described below. Explicit &lt;hyp> declarations don't work
  +       yet (patches welcome). Exceptions are generally a bit brittle, test
  +       carefully.</li>
  +      <li>&lt;patterns> includes whitespace-separated patterns, which are what
  +       drive most hyphenation decisions. The characters in these patterns are
  +       explained as follows:
           <ul>
  -          <li>non-numeric characters represent characters in a sub-word to be 
evaluated</li>
  -          <li>the period character (.) represents a word boundary, i.e. either the 
beginning or ending of a word</li>
  -          <li>numeric characters represent a scoring system for indicating the 
acceptability of a hyphen in this location. Only odd numbers represent an acceptable 
location for a hyphen, with 5 being most desirable, and 1 being least desirable. Even 
numbers indicate an unacceptable location, with zero (implied when there is no number 
present) being unacceptable, and 4 being extremely unacceptable.</li>
  +          <li>non-numeric characters represent characters in a sub-word to be
  +           evaluated</li>
  +          <li>the period character (.) represents a word boundary, i.e. either
  +           the beginning or ending of a word</li>
  +          <li>numeric characters represent a scoring system for indicating the
  +           acceptability of a hyphen in this location. Odd numbers represent an
  +           acceptable location for a hyphen, with higher values overriding lower
  +           inhibiting values. Even numbers indicate an unacceptable location, with
  +           higher values overriding lower values indicating an acceptable position.
  +           A value of zero (inhibiting) is implied when there is no number present.
  +           Generally patterns are constructed so that valuse greater than 4 are 
rare.
  +           Due to a bug currently patterns with values of 8 and greater don't
  +           have an effect, so don't wonder.</li>
           </ul>
           Here are some examples from the English patterns file:
           <ul>
  @@ -134,7 +219,32 @@
           Note that the algorithm that uses this data searches for each of the word's 
substrings in the patterns, and chooses the <em>highest</em> value found for letter 
combination.
         </li>
       </ul>
  -    <note>An open-source utility called patgen is available on many Unix/Linux 
distributions to assist in creating pattern files from dictionaries. Consult man pages 
or the www for details.</note>
  +    <p>If you want to convert a TeX hyphenation pattern file, you have to undo
  +     the TeX encoding for non-ASCII text. FOP uses Unicode, and the patterns
  +     must be proper Unicode too. You should be aware of the XML encoding issues,
  +     preferably use a good Unicode editor.</p>
  +    <p>Note that FOP does not do Unicode character normalization. If you use
  +     combining chars for accents and other character decorations, you must
  +     declare character classes for them, and use the same sequence of base character
  +     and combining marks in the XSLFO source, otherwise the pattern wouldn't match.
  +     Fortunately, Unicode provides precomposed characters for all important cases
  +     in common languages, until now nobody run seriously into this issue. Some dead
  +     languages and dialects, especially ancient ones, may pose a real problem
  +     though.</p>
  +    <p>If you want to generate your own patterns, an open-source utility called
  +     patgen is available on many Unix/Linux distributions and every TeX
  +     distribution which can be used to assist in
  +     creating pattern files from dictionaries. Pattern creation for languages like
  +     english or german is an art. If you can, read Frank Liang's original paper
  +     "Word Hy-phen-a-tion by Com-pu-ter" (yes, with hyphens). It is not available
  +     online. The original patgen.web source, included in the TeX source 
distributions,
  +     contains valuable comments, unfortunately technical details obscure often the
  +     high level issues. Another important source is
  +     <fork href="http://www.ctan.org/tex-archive/systems/knuth/tex/texbook.tex";>The
  +     TeX Book</fork>, appendix H (either read the TeX source, or run it through
  +     TeX to typeset it). Secondary articles, for example the works by Petr Sojka,
  +     may alos give some much needed insigth into problems arising in automated
  +     hyphenation.</note>
     </section>
     </body>
   </document>
  
  
  

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to