Bug#456360: texlive-lang: Inaccurate package names

Jordà Polo Mon, 04 Feb 2008 10:12:29 -0800

On Sat, Dec 15, 2007 at 11:16:07AM +0100, Jordà Polo wrote:
> OK. I have a few ideas, but I'm pretty busy at the moment. I'll think about
> it and try to come up with a consistent proposal. I'm not sure I'll succeed,
> but at least I'll try. Whatever the outcome, I'll surely reply to this bug
> in 2 months.


I finally had some time to take a look at this issue. At first I thought
it would be a matter of moving languages to the right group/family. But
it is not that easy since some collections are based on packages that
include resources for more than one language.

Frank Küster mentioned sizes, so let's take a look at the numbers. This
is the list of texlive-lang-* packages, sorted by Installed-Size:

   164 texlive-lang-italian 
   168 texlive-lang-latin 
   172 texlive-lang-danish 
   176 texlive-lang-manju 
   204 texlive-lang-finnish 
   204 texlive-lang-spanish 
   216 texlive-lang-ukenglish 
   248 texlive-lang-dutch 
   304 texlive-lang-portuguese 
   336 texlive-lang-swedish 
   356 texlive-lang-norwegian 
   356 texlive-lang-other 
   504 texlive-lang-hungarian 
   540 texlive-lang-hebrew 
   812 texlive-lang-croatian 
  1096 texlive-lang-german 
  1752 texlive-lang-armenian 
  1820 texlive-lang-french 
  2240 texlive-lang-tibetan 
  3960 texlive-lang-czechslovak 
  5828 texlive-lang-mongolian 
 10080 texlive-lang-arab 
 10092 texlive-lang-polish 
 10164 texlive-lang-african 
 12212 texlive-lang-indic 
 12800 texlive-lang-vietnamese 
 14996 texlive-lang-cyrillic 
 16456 texlive-lang-greek 

And listed below is a summary with random comments about the most
interesting packages (packages not listed below are "simple" packages
that only include hyphen files or fonts for a single language). There
are also a few comments in parenthesis that aren't really relevant to
the discussion, but may be of interest for the maintainers. Also, note
that in the following lines "package" actually refers to CTAN packages,
not Debian packages.

* texlive-lang-manju: Basically includes manjutex, a package that offers
  support for Manju, a language with very few speakers and a writing
  system derived from the Mongolian script[1]. (Btw, the documentation,
  written in April 2001, says: «This package is founded on MonTeX and
  will finally merge with MonTeX in order to provide all Mongolian
  writings.» The MonTex documentation dates from 2002/07/01. You can
  also read the following at ctan.org[2]: «This catalogue entry
  describes the ‘original’ ManjuTeX; its functionality has now been
  subsumed into monTeX, though the obsolete ManjuTeX remains on the
  archive.» Does it make sense to include such obsolete packages in
  Debian?)

  1. http://en.wikipedia.org/wiki/Manchu_language
  2. http://www.ctan.org/tex-archive/help/Catalogue/entries/manjutex.html

* texlive-lang-spanish: Hyphenation files for Catalan, Spanish and
  Galician. It is interesting that Catalan and Galician are the only
  languages that don't have their own texmf/tpm/hyphen-*.tpm file. Both
  are included in texmf/tpm/hyphen-spanish.tpm, which is probably what
  lead to the wrong description in the Debian package.

* texlive-lang-other: Hyphenation files for Coptic, Esperanto, Estonian,
  Icelandic, Indonesian, Interlingua, Romanian, Serbian, Slovene,
  Turkish, Sorbian and Welsh (title and description in
  texmf/tpm/hyphen-welsh.tpm are wrong btw, Welsh != Czechoslovak).

* texlive-lang-german: Basically German resources. (It also includes
  umlaute, which is obsolete according to the documentation: «This
  package is obsolete! This package was superseeded by the inputenc
  package which is included in any LaTeX 2ε system since December 1994.
  Therefore this package is no longer supported; so please don’t use
  umlaute, just use inputenc instead.» In Debian, there is also ginpenc,
  which is included in texlive-latex-extra.)

* texlive-lang-french: Mostly French related packages, but it also
  includes the Basque hyphenation files.

* texlive-lang-czechslovak: Based on packages that include resources for
  both Czech and Slovak, so it probably makes sense as it is unless
  someone wants to split them. The description and title in
  hyphen-czechslovak.tpm is wrong though, that file doesn't include
  «Fonts for typesetting some Czechslovak scripts» but «Czech and Slovak
  hyphenation files» (the word czechslovak doesn't exist AFAIK).

* texlive-lang-mongolian: Includes hyphenation files and support for
  writing Mongolian languages in various scripts, but it also supports
  Manju (montex package). It also includes another package for Soyombo,
  an ancient script.

* texlive-lang-arab: Includes arabi and arabtex. The former is used to
  write Arabic and Farsi, while the latter is focused on Arabic but also
  provides limited support for other languages written in the arabic
  alphabet: Farsi, Dari, Urdu, Pashto, Maghribi. (Would it be a good
  idea to rename this collection to arabic, which is the name of the
  script?)

* texlive-lang-african: On the one hand, it includes fonts for many
  african scripts (see doc/fonts/fc/fc.rme:71 for a full list). On the
  other hand, fonts for the Ethiopic alphabet. Ethiopic fonts (ethiop,
  ethiop-t1) are approximately twice as large as the other african fonts
  (fc). Perhaps it would be a good idea to split this package.

* texlive-lang-indic: This package is part of texlive-bin, not
  texlive-lang.

* texlive-lang-cyrillic: Hyphenation files for Bulgarian, Russian and
  Ukranian; Fonts and support for Cyrillic languages; document classes
  to make documents in accordance with russian standards, etc.

* texlive-lang-greek: Includes both, classical and modern Greek. (It is
  rather large, so, would it make sense to split ancient/classical Greek
  from modern Greek?)


After reviewing the collections I can understand some of them, specially
the larger ones. But I still can't understand why, among the tiny
language packs, some are individual and some are not. 

What makes a language worth its own collection? How is that Italian or
Manju or have a collection while Romanian or Galician don't? It isn't a
matter of number of speakers (with ~60 speakers you can hardly beat
Manju), nor is it a problem of size, since most of these language packs
are similarly small. That's, IMHO, the problem that should be addressed.

If the number of packages and its sizes are that important, then these
factors should be taken into account. One option would be to include
languages smaller than X (1MiB? 0.5? I don't really know where to draw
the line) in "family" collections. So, for example, Estonian, Finnish
and Hungarian would become part of the uralic collection.

Anyway, this is not a real proposal yet, I just wanted to share my
thoughts since I know more people are following this bug report. Note
that IANAL (I am not a _linguist_), so I probably made some mistakes.
Comments and suggestions are welcome.

Bug#456360: texlive-lang: Inaccurate package names

Reply via email to