Bug#307647: tex4ht: unicode used when it is not needed
On Thursday 02 June 2005 02.16, Eitan Gurari wrote: >_In its default configuration, TeX4ht tries to address a middle road >_with the objective to satisfy large assortment of users, uses, and >_tools. It is difficult to argue a proper way of behavior that will be >_acceptable to all. On the other hand, tex4ht is highly configurable I respect that your situation is not easy. Probably, high and easy configurability is the only solution. Best wishes, Gabor Braun -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
> >> tex4ht makes use of unicode letter when this is not needed... In its default configuration, TeX4ht tries to address a middle road with the objective to satisfy large assortment of users, uses, and tools. It is difficult to argue a proper way of behavior that will be acceptable to all. On the other hand, tex4ht is highly configurable and variations to the default settings are quite often easy to achieve. > >3. It *is* possible for you to define an alternate mechanism to avoid > > ligatures---create your own htf files which skip the ligatures. Under the current font schema introduced half a year ago, it is trivial to adjust tex4ht to ignore ligatures. All it takes is just adding the following lines into the unicode.4hf file of the character encoding in use. 'fi' '' 'fi' '' 'fl' '' 'fl' '' 'ff' '' 'ff' '' 'ffi' '' 'ffi' '' 'ffl' '' 'ffl' '' These entries are currently included in the default setting for the iso-8859-1 encoding due to font problems at users' browsers. -eitan -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
>> tex4ht makes use of unicode letter when this is not needed. This happens when >> the latex code contains the sequence "ff" or "fi" and maybe other sequences. >> For >> example, here is a latex code and the html code generated by ht4tex and >> htlatex. Note how >> the sequence "fi" was translated to fi > >Could you please tell me why you think this is a bug? Please keep the >following in mind. > >1. TeX4HT tries as much as possible to be *like* TeX except that it > outputs hypertext. > >2. TeX uses ligatures whenever it encounters ff, fi, fl and so on. > >3. It *is* possible for you to define an alternate mechanism to avoid > ligatures---create your own htf files which skip the ligatures. It is not me who originally sent the bug but I do agree that TeX4HT shouldn't put ligatures in its output. Main argument: ligatures are not appropriate for html and other outputs of TeX4HT by their nature. Ligatures were invented for better representing groups of letters on _paper_. TeX also uses kerning (adjusting spaces between letters) for such purpose, which TeX4HT omits in its output. Html document format is not designed to contain excessive formatting information: formatting decisions (breaking paragraphs into lines, often font choice) are left to the browser. Kerning and ligatures also fall into this category, since they are influenced by the choice of font. TeX4HT also puts ligatures in DocBook output. DocBook is designed for the structural content of a document, no formatting. Noone puts ligatures in a DocBook document manually. As to your 3 points above: To point 1: Since TeX4HT outputs hypertext, it should differ from TeX (which is designed for paper output) whenever the different nature of output justifies it. To point 2: Just what I have said above: the different kind of output justifies omitting ligatures for TeX4HT and the use of ligatures by TeX. If TeX4HT omits kerning but keeps ligatures, this requires further explanation. Gabor Braun -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Dear Professor Gurari, thank you very much. It works like charm. Rani On 5/8/05, Eitan Gurari <[EMAIL PROTECTED]> wrote: > > > I modified the bugfixes distribution to provide reduced usage of > unicode values in iso-8859-1 output. The requests are to be made > through commands similar to > >htlatex file "" "iso8859/1/charset/less/!" > > or by modifying the charset paths in tex4ht.env accordingly. > Currently the only cases addressed are the ligatures 'ff' and 'fi' and > a few non-ligature values. Additional cases will be addressed in > response to bug reports. > > -eitan > > > tex4ht makes use of unicode letter when this is not needed. This happens > when > > the latex code contains the sequence "ff" or "fi" and maybe other > sequences. For > > example, here is a latex code and the html code generated by ht4tex and > htlatex. Note how > > the sequence "fi" was translated to fi >
Bug#307647: tex4ht: unicode used when it is not needed
Dear Eitan, Thanks for your work on this bug. A couple of things will slow down the incorporation of these changes into Debian packages. 1. I am currently in transit so I won't really be able to download the changed files right away. 2. Debian is currently in "freeze" pending a release so only "critical" bug-fixes are being accepted. With best regards, Kapil. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Change of mind :-( I modified the default setting to ask for a restricted usage of unicode values. The extended support can be obtained with commands of the form htlatex file "" "iso8859/1/charset/uni/!" or by introducing a corresponding charset path in tex4ht.env. -eitan > I modified the bugfixes distribution to provide reduced usage of > unicode values in iso-8859-1 output. The requests are to be made > through commands similar to > >htlatex file "" "iso8859/1/charset/less/!" > > or by modifying the charset paths in tex4ht.env accordingly. > Currently the only cases addressed are the ligatures 'ff' and 'fi' and > a few non-ligature values. Additional cases will be addressed in > response to bug reports. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
I modified the bugfixes distribution to provide reduced usage of unicode values in iso-8859-1 output. The requests are to be made through commands similar to htlatex file "" "iso8859/1/charset/less/!" or by modifying the charset paths in tex4ht.env accordingly. Currently the only cases addressed are the ligatures 'ff' and 'fi' and a few non-ligature values. Additional cases will be addressed in response to bug reports. -eitan > tex4ht makes use of unicode letter when this is not needed. This happens when > the latex code contains the sequence "ff" or "fi" and maybe other sequences. > For > example, here is a latex code and the html code generated by ht4tex and > htlatex. Note how > the sequence "fi" was translated to fi -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Kapil, I agree the ligatures shouldn't be represented by bitmaps. To deal just with those cases a nolig file can be a copy of ht-fonts/iso8859/1/charset/unicode.4hf stored at ht-fonts/iso8859/1/charset/nolig/unicode.4hf augmented with entries similar to 'fi' '' 'fi' '' 'fl' '' 'fl' '' For such a case, a compilation can be requested with a comamnd similar to htlatex file "" "iso8859/1/charset/nolig/!" or the tex4ht.env file should have its charset directory path modified accordingly. TeX doesn't see the htf fonts--only the postprocessor tex4ht.s deals with them. The tex4ht system however requires much resources from tex. The tex system I run provides the following resources. 17537 strings out of 61437 369958 string characters out of 4947194 2144172 words of memory out of 801 20492 multiletter control sequences out of 1+65535 8669 words of font info for 31 fonts, out of 100 for 1000 14 hyphenation exceptions out of 1000 36i,8n,28p,231b,2972s stack positions out of 15000i,4000n,6000p,20b,4s -eitan > I wasn't thinking of making bitmap fonts for the ligatures. I > understood the requirement as being roughly "why not use ascii text > in places where ascii text could suffice for conveying the content". > > So I was thinking of just using 'ff' , 'fi' and so on in place > of 'ff' and so on in a font file heirarchy called "nolig". This > directory heirarchy would "break" ligatures for all the latin > characters. > > It may also be possible to ask TeX to avoid ligatures during its run. > > Another possibility is to check whether (X)HTML allows for "ALT" tags > or some CSS statement which permits font/glyph substitution. > P.S. While trying to create the font files using the source I noticed > that one needs the environment variable "extra_mem_top" to be set to > about "10" or so in order for TeX to run successfully with the htf > source files. Is this how you run it? -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Dear Eitan, On Thu, May 05, 2005 at 11:34:56PM -0400, Eitan Gurari wrote: > Some background information regarding the problem. Thanks for this info. > The unicode.4hf mapping currently doesn't allow creation of bitmap > fonts. For that to happen the tex4ht.c code needs to be modified to > provide enhanced support for unicode.4hf files. I wasn't thinking of making bitmap fonts for the ligatures. I understood the requirement as being roughly "why not use ascii text in places where ascii text could suffice for conveying the content". So I was thinking of just using 'ff' , 'fi' and so on in place of 'ff' and so on in a font file heirarchy called "nolig". This directory heirarchy would "break" ligatures for all the latin characters. It may also be possible to ask TeX to avoid ligatures during its run. Another possibility is to check whether (X)HTML allows for "ALT" tags or some CSS statement which permits font/glyph substitution. Regards, Kapil. P.S. While trying to create the font files using the source I noticed that one needs the environment variable "extra_mem_top" to be set to about "10" or so in order for TeX to run successfully with the htf source files. Is this how you run it? -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Kapil, Some background information regarding the problem. In the `old days', tex4ht provided for a given (la)tex font different htf fonts addressing different character sets. For instance, the (la)tex cmr family of fonts had htf fonts under the unicode and iso-8859-1 branches. In the iso branch quite a few characters got bitmap representations due to lack of native support in the iso character set. About half a year ago I started deleting the non-unicode htf fonts, and provide instead unicode.4hf translation files. When tex4ht.c fails to find a htf font for a character set it internally creates such a font from the unicode version using the appropriate unicode.4hf mapping. For instance, the iso-8859-1 version of cmr.htf is created from the unicode version of cmr.htf through the mapping provided in the iso-8859-1 version of unicode.4hf. The unicode.4hf mapping currently doesn't allow creation of bitmap fonts. For that to happen the tex4ht.c code needs to be modified to provide enhanced support for unicode.4hf files. -eitan > Eitan may soon provide the possibility of latin fonts as an option > thereby causing this problem to disappear. I am trying my hand at a > solution as well. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Dear Rani, On Thu, May 05, 2005 at 09:06:08AM +0300, Ran Gilad-Bachrach wrote: > My main goal in using tex4ht is to share documents with people who > do not use TeX, or process the documents by other programs. For this > purpose, the problem I have reported on is "important" as it prevents > such use. However, for the sake of publishing a document in html > format, this is of no major concern. Thus, I accept your opinion that > this should be counted in the wish list. Consequent to your e-mail I examined this a little further---so I am re-evaluating. Until the world switches over to unicode ... I think it should certainly be possible to choose to use a set of fonts that does not use unicode for latin characters. So I am upgrading the bug to "normal". Eitan may soon provide the possibility of latin fonts as an option thereby causing this problem to disappear. I am trying my hand at a solution as well. Thanks and regards, Kapil. -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Dear Kapil, My main goal in using tex4ht is to share documents with people who do not use TeX, or process the documents by other programs. For this purpose, the problem I have reported on is "important" as it prevents such use. However, for the sake of publishing a document in html format, this is of no major concern. Thus, I accept your opinion that this should be counted in the wish list. Thank you for the great assistance, Rani On 5/5/05, Kapil Hari Paranjape <[EMAIL PROTECTED]> wrote: > Dear Ran Gilad-Bachrach, > > Please see the enclosed mail from the author Eitan Gurari. > He is planning to provide a fix in the next version. For > the time being I think I will agree with Vassilii that this is > really "wishlist" rather than "important" (at least as a bug for > tex4ht---I do think it is up to text viewers/browsers that > render unicode to do this job as correctly as possible). > > On Wed, May 04, 2005 at 03:04:48PM -0400, Eitan Gurari wrote: > > Unfortunately, too many people complain about this and other similar > > lack of font support problems by browsers for unicode symbols. I'll > > try to `fix' the problem the coming weekend. -eitan > > Perhaps the fix will take the form of an option for mk4ht/htlatex that > selects non-unicode glyph substitution. > > I hope I have your permission. I am re-tagging this as a wishlist item. > > Thanks and regards, > > Kapil. > -- > >
Bug#307647: tex4ht: unicode used when it is not needed
Dear Ran Gilad-Bachrach, Please see the enclosed mail from the author Eitan Gurari. He is planning to provide a fix in the next version. For the time being I think I will agree with Vassilii that this is really "wishlist" rather than "important" (at least as a bug for tex4ht---I do think it is up to text viewers/browsers that render unicode to do this job as correctly as possible). On Wed, May 04, 2005 at 03:04:48PM -0400, Eitan Gurari wrote: > Unfortunately, too many people complain about this and other similar > lack of font support problems by browsers for unicode symbols. I'll > try to `fix' the problem the coming weekend. -eitan Perhaps the fix will take the form of an option for mk4ht/htlatex that selects non-unicode glyph substitution. I hope I have your permission. I am re-tagging this as a wishlist item. Thanks and regards, Kapil. -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
> however I have noticed two things which makes the conversion tex4ht > does problematic. First, when you open the html file using a browser, > the ff,fi,... combination look different than the rest of the text > (blurred). This looks like a browser bug to me, if a unicode char is generated for the ligature, and the browser is showing it differently than the surrounding chars. In general a run of text within the same font should look the same wrt the font weight etc. Could also be a font problem, maybe your default browser font doesn't have the ligature chars hinted correctly while the other chars are, and you're using a scaled font. > Second, the funny conversion makes it hard to apply > post-processors, such as spell checkers and syntax checkers to the > html file. Why don't you use a Latex-aware spell checker, like ispell, on the TeX source? As for the HTML syntax checking (i.e., validation of the SGML markup), I doubt it should matter whether the character entities for the ligatures are in or not. > you are probably right that by creating an htf file this can be > done. However, I would expect that this would be the default behavior, > hence I do not think that the user should be bothered with doing that. > Nevertheless, I might be wrong ... Just a note from another user of tex4ht who thinks this is definitely not a bug, but rather a wishlist. Personally, I can't identify with the need to implement this feature, but could be other users might want it as well. V. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Unfortunately, too many people complain about this and other similar lack of font support problems by browsers for unicode symbols. I'll try to `fix' the problem the coming weekend. -eitan > > tex4ht makes use of unicode letter when this is not needed. This happens > > when > > the latex code contains the sequence "ff" or "fi" and maybe other > > sequences. For > > example, here is a latex code and the html code generated by ht4tex and > > htlatex. Note how > > the sequence "fi" was translated to fi > > Could you please tell me why you think this is a bug? Please keep the > following in mind. > > 1. TeX4HT tries as much as possible to be *like* TeX except that it >outputs hypertext. > > 2. TeX uses ligatures whenever it encounters ff, fi, fl and so on. > > 3. It *is* possible for you to define an alternate mechanism to avoid >ligatures---create your own htf files which skip the ligatures. > > Thanks and best regards, > > Kapil. > -- > > -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Dear Kapil, Thank you for the prompt answer. I am not an expert in type-setting however I have noticed two things which makes the conversion tex4ht does problematic. First, when you open the html file using a browser, the ff,fi,... combination look different than the rest of the text (blurred). Second, the funny conversion makes it hard to apply post-processors, such as spell checkers and syntax checkers to the html file. you are probably right that by creating an htf file this can be done. However, I would expect that this would be the default behavior, hence I do not think that the user should be bothered with doing that. Nevertheless, I might be wrong ... thank you once again, Rani On 5/4/05, Kapil Hari Paranjape <[EMAIL PROTECTED]> wrote: > Dear Ran Gilad-Bachrach, > > Thanks for your report. > > On Wed, May 04, 2005 at 03:58:36PM +0300, Ran Gilad-Bachrach wrote: > > tex4ht makes use of unicode letter when this is not needed. This happens > > when > > the latex code contains the sequence "ff" or "fi" and maybe other > > sequences. For > > example, here is a latex code and the html code generated by ht4tex and > > htlatex. Note how > > the sequence "fi" was translated to fi > > Could you please tell me why you think this is a bug? Please keep the > following in mind. > > 1. TeX4HT tries as much as possible to be *like* TeX except that it >outputs hypertext. > > 2. TeX uses ligatures whenever it encounters ff, fi, fl and so on. > > 3. It *is* possible for you to define an alternate mechanism to avoid >ligatures---create your own htf files which skip the ligatures. > > Thanks and best regards, > > Kapil. > -- > >
Bug#307647: tex4ht: unicode used when it is not needed
Dear Ran Gilad-Bachrach, Thanks for your report. On Wed, May 04, 2005 at 03:58:36PM +0300, Ran Gilad-Bachrach wrote: > tex4ht makes use of unicode letter when this is not needed. This happens when > the latex code contains the sequence "ff" or "fi" and maybe other sequences. > For > example, here is a latex code and the html code generated by ht4tex and > htlatex. Note how > the sequence "fi" was translated to fi Could you please tell me why you think this is a bug? Please keep the following in mind. 1. TeX4HT tries as much as possible to be *like* TeX except that it outputs hypertext. 2. TeX uses ligatures whenever it encounters ff, fi, fl and so on. 3. It *is* possible for you to define an alternate mechanism to avoid ligatures---create your own htf files which skip the ligatures. Thanks and best regards, Kapil. -- -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]
Bug#307647: tex4ht: unicode used when it is not needed
Package: tex4ht Version: 20050402.1817-1 Severity: important tex4ht makes use of unicode letter when this is not needed. This happens when the latex code contains the sequence "ff" or "fi" and maybe other sequences. For example, here is a latex code and the html code generated by ht4tex and htlatex. Note how the sequence "fi" was translated to fi --- newfile1.tex - %% LyX 1.3 created this file. For more info, see http://www.lyx.org/. %% Do not edit unless you really know what you are doing. \documentclass[english]{article} \usepackage[latin1]{inputenc} \makeatletter \usepackage{babel} \makeatother \begin{document} efficient classifier \end{document} --- newfile1.html (tex4ht) --- efficient classifier --- newfile1.html (htlatex) -- http://www.w3.org/TR/html4/loose.dtd";> http://www.cse.ohio-state.edu/~gurari/TeX4ht/mn.html)"> http://www.cse.ohio-state.edu/~gurari/TeX4ht/mn.html)"> efficient classifier -- System Information: Debian Release: 3.1 APT prefers testing APT policy: (500, 'testing') Architecture: i386 (i686) Kernel: Linux 2.6.8 Locale: LANG=he_IL, LC_CTYPE=he_IL (charmap=ISO-8859-8) Versions of packages tex4ht depends on: ii libc6 2.3.2.ds1-20 GNU C Library: Shared libraries an ii libkpathsea32.0.2-28 path search library for teTeX (run ii tetex-bin 2.0.2-28 The teTeX binary files -- no debconf information -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]