>> the basic ("knuthian") tex hyphenation algorithm does not handle
>> any words with diacritics, and that is what the us list is based
>> on.In general, this is not a restriction since up to 256 characters are allowed in `patgen', which is the ultimate program to generate hyphenation patterns. Non-English hyphenation patterns simply use precomposed characters with diacritics; for example, the German patterns now use the latin-9 character set. The English patterns could do exactly the same to allow stuff like `chef d'œuvre' (assuming that this word could be hyphenated, which is probably not true :-). The very issue is rather that *users* are not accomodated to select an input and/or font encoding while typesetting US English texts. The only chance to improve that IMHO is to use TeX systems that natively use UTF-8. So groff has a slight advantage here over plain TeX since it is set up by default to use latin-1. Note, however, that noone takes care of the US patterns. The most recent version used in the `tex-hyphen' project at https://github.com/hyphenation/tex-hyphen is from 1990! In other words, the only `standardized' corrective is Barbara's list... > I see. Werner (or anyone else familiar with the groff side of > things), is this limitation also present in groff? Or could groff's > version of tmac/hyphenex.us be put into Latin-9 encoding to > accommodate these words? It could. However, for the sake of maintainability, I strongly suggest that `hyphenex.us' stays in sync with the original one edited by Barbara. You can always add new entries with the `.hw' request (provided your setup correctly understands the corresponding encoding; have a look how German is handled, for example). >> i'm surprised that the encoding is (still?) listed as latin-* -- >> there has been an effort to support utf8, so i (perhaps rashly) >> assumed that would be the base encoding. groff cannot digest UTF-8 natively. However, there are means to automatically map UTF-8 to its internal representation, which usually is latin-1, together with constructs like \[uXXXX] to access Unicode encoded characters outside the selected encoding. > http://git.savannah.gnu.org/gitweb/?p=groff.git;a=history;f=tmac/hyphenex.det;h=c74eebabff8e35353fdfb176a5c98df56c3e4ea0;hb=HEAD `hyphenex.det' is no longer maintained – and now deleted from the repository: I took the opportunity to completely update the German hyphenation patterns, and this file is no longer needed. > Their encodings on the TeX side may have been updated, and the > changes never pulled to groff. Today, almost all hyphenation patterns in the `tex-hyphen' repository (and thus in the distribution from CTAN) are in UTF-8 encoding. > In contrast (and probably because of this thread), groff's > tmac/hyphenex.us was updated from TeX four days ago: Exactly. > This file does not specify any encoding, but its entire contents > fall into 7-bit ASCII. Well, the list simply doesn't contain any non-ASCII words... Werner _______________________________________________ bug-groff mailing list [email protected] https://lists.gnu.org/mailman/listinfo/bug-groff
