On Sat, Dec 15, 2007 at 11:16:07AM +0100, Jordà Polo wrote: > OK. I have a few ideas, but I'm pretty busy at the moment. I'll think about > it and try to come up with a consistent proposal. I'm not sure I'll succeed, > but at least I'll try. Whatever the outcome, I'll surely reply to this bug > in 2 months.
I finally had some time to take a look at this issue. At first I thought it would be a matter of moving languages to the right group/family. But it is not that easy since some collections are based on packages that include resources for more than one language. Frank Küster mentioned sizes, so let's take a look at the numbers. This is the list of texlive-lang-* packages, sorted by Installed-Size: 164 texlive-lang-italian 168 texlive-lang-latin 172 texlive-lang-danish 176 texlive-lang-manju 204 texlive-lang-finnish 204 texlive-lang-spanish 216 texlive-lang-ukenglish 248 texlive-lang-dutch 304 texlive-lang-portuguese 336 texlive-lang-swedish 356 texlive-lang-norwegian 356 texlive-lang-other 504 texlive-lang-hungarian 540 texlive-lang-hebrew 812 texlive-lang-croatian 1096 texlive-lang-german 1752 texlive-lang-armenian 1820 texlive-lang-french 2240 texlive-lang-tibetan 3960 texlive-lang-czechslovak 5828 texlive-lang-mongolian 10080 texlive-lang-arab 10092 texlive-lang-polish 10164 texlive-lang-african 12212 texlive-lang-indic 12800 texlive-lang-vietnamese 14996 texlive-lang-cyrillic 16456 texlive-lang-greek And listed below is a summary with random comments about the most interesting packages (packages not listed below are "simple" packages that only include hyphen files or fonts for a single language). There are also a few comments in parenthesis that aren't really relevant to the discussion, but may be of interest for the maintainers. Also, note that in the following lines "package" actually refers to CTAN packages, not Debian packages. * texlive-lang-manju: Basically includes manjutex, a package that offers support for Manju, a language with very few speakers and a writing system derived from the Mongolian script[1]. (Btw, the documentation, written in April 2001, says: «This package is founded on MonTeX and will finally merge with MonTeX in order to provide all Mongolian writings.» The MonTex documentation dates from 2002/07/01. You can also read the following at ctan.org[2]: «This catalogue entry describes the ‘original’ ManjuTeX; its functionality has now been subsumed into monTeX, though the obsolete ManjuTeX remains on the archive.» Does it make sense to include such obsolete packages in Debian?) 1. http://en.wikipedia.org/wiki/Manchu_language 2. http://www.ctan.org/tex-archive/help/Catalogue/entries/manjutex.html * texlive-lang-spanish: Hyphenation files for Catalan, Spanish and Galician. It is interesting that Catalan and Galician are the only languages that don't have their own texmf/tpm/hyphen-*.tpm file. Both are included in texmf/tpm/hyphen-spanish.tpm, which is probably what lead to the wrong description in the Debian package. * texlive-lang-other: Hyphenation files for Coptic, Esperanto, Estonian, Icelandic, Indonesian, Interlingua, Romanian, Serbian, Slovene, Turkish, Sorbian and Welsh (title and description in texmf/tpm/hyphen-welsh.tpm are wrong btw, Welsh != Czechoslovak). * texlive-lang-german: Basically German resources. (It also includes umlaute, which is obsolete according to the documentation: «This package is obsolete! This package was superseeded by the inputenc package which is included in any LaTeX 2ε system since December 1994. Therefore this package is no longer supported; so please don’t use umlaute, just use inputenc instead.» In Debian, there is also ginpenc, which is included in texlive-latex-extra.) * texlive-lang-french: Mostly French related packages, but it also includes the Basque hyphenation files. * texlive-lang-czechslovak: Based on packages that include resources for both Czech and Slovak, so it probably makes sense as it is unless someone wants to split them. The description and title in hyphen-czechslovak.tpm is wrong though, that file doesn't include «Fonts for typesetting some Czechslovak scripts» but «Czech and Slovak hyphenation files» (the word czechslovak doesn't exist AFAIK). * texlive-lang-mongolian: Includes hyphenation files and support for writing Mongolian languages in various scripts, but it also supports Manju (montex package). It also includes another package for Soyombo, an ancient script. * texlive-lang-arab: Includes arabi and arabtex. The former is used to write Arabic and Farsi, while the latter is focused on Arabic but also provides limited support for other languages written in the arabic alphabet: Farsi, Dari, Urdu, Pashto, Maghribi. (Would it be a good idea to rename this collection to arabic, which is the name of the script?) * texlive-lang-african: On the one hand, it includes fonts for many african scripts (see doc/fonts/fc/fc.rme:71 for a full list). On the other hand, fonts for the Ethiopic alphabet. Ethiopic fonts (ethiop, ethiop-t1) are approximately twice as large as the other african fonts (fc). Perhaps it would be a good idea to split this package. * texlive-lang-indic: This package is part of texlive-bin, not texlive-lang. * texlive-lang-cyrillic: Hyphenation files for Bulgarian, Russian and Ukranian; Fonts and support for Cyrillic languages; document classes to make documents in accordance with russian standards, etc. * texlive-lang-greek: Includes both, classical and modern Greek. (It is rather large, so, would it make sense to split ancient/classical Greek from modern Greek?) After reviewing the collections I can understand some of them, specially the larger ones. But I still can't understand why, among the tiny language packs, some are individual and some are not. What makes a language worth its own collection? How is that Italian or Manju or have a collection while Romanian or Galician don't? It isn't a matter of number of speakers (with ~60 speakers you can hardly beat Manju), nor is it a problem of size, since most of these language packs are similarly small. That's, IMHO, the problem that should be addressed. If the number of packages and its sizes are that important, then these factors should be taken into account. One option would be to include languages smaller than X (1MiB? 0.5? I don't really know where to draw the line) in "family" collections. So, for example, Estonian, Finnish and Hungarian would become part of the uralic collection. Anyway, this is not a real proposal yet, I just wanted to share my thoughts since I know more people are following this bug report. Note that IANAL (I am not a _linguist_), so I probably made some mistakes. Comments and suggestions are welcome.

