Bug#307647: tex4ht: unicode used when it is not needed

2005-06-05 Thread Braun Gabor
On Thursday 02 June 2005 02.16, Eitan Gurari wrote:
_In its default configuration, TeX4ht tries to address a middle road
_with the objective to satisfy large assortment of users, uses, and
_tools.  It is difficult to argue a proper way of behavior that will be
_acceptable to all.  On the other hand, tex4ht is highly configurable

I respect that your situation is not easy.  Probably, high and easy 
configurability is the only solution.

Best wishes,

 Gabor Braun


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-06-01 Thread Braun Gabor
 tex4ht makes use of unicode letter when this is not needed. This happens when
 the latex code contains the sequence ff or fi and maybe other sequences. 
 For
 example, here is a latex code and the html code generated by ht4tex and 
 htlatex. Note how
 the sequence fi was translated to #xFB01;

Could you please tell me why you think this is a bug? Please keep the
following in mind.

1. TeX4HT tries as much as possible to be *like* TeX except that it
   outputs hypertext.

2. TeX uses ligatures whenever it encounters ff, fi, fl and so on.

3. It *is* possible for you to define an alternate mechanism to avoid
   ligatures---create your own htf files which skip the ligatures.

It is not me who originally sent the bug but I do agree that TeX4HT shouldn't
put ligatures in its output.

Main argument: ligatures are not appropriate for html and other outputs of
TeX4HT by their nature.

Ligatures were invented for better representing groups of letters on _paper_.
TeX also uses kerning (adjusting spaces between letters) for such purpose,
which TeX4HT omits in its output.

Html document format is not designed to contain excessive formatting
information: formatting decisions (breaking paragraphs into lines, often font
choice) are left to the browser.  Kerning and ligatures also fall into this
category, since they are influenced by the choice of font.

TeX4HT also puts ligatures in DocBook output.  DocBook is designed for the
structural content of a document, no formatting. Noone puts ligatures in a
DocBook document manually.

As to your 3 points above:
To point 1: Since TeX4HT outputs hypertext, it should differ from TeX
(which is designed for paper output) whenever the different nature of output
justifies it.

To point 2: Just what I have said above: the different kind of output justifies
omitting ligatures for TeX4HT and the use of ligatures by TeX.
If TeX4HT omits kerning but keeps ligatures, this requires further explanation.

 Gabor Braun


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-06-01 Thread Eitan Gurari

  tex4ht makes use of unicode letter when this is not needed...

In its default configuration, TeX4ht tries to address a middle road
with the objective to satisfy large assortment of users, uses, and
tools.  It is difficult to argue a proper way of behavior that will be
acceptable to all.  On the other hand, tex4ht is highly configurable
and variations to the default settings are quite often easy to
achieve.

 3. It *is* possible for you to define an alternate mechanism to avoid
ligatures---create your own htf files which skip the ligatures.

Under the current font schema introduced half a year ago, it is
trivial to adjust tex4ht to ignore ligatures.  All it takes is just
adding the following lines into the unicode.4hf file of the character
encoding in use.

'#xFB01;' ''  'fi'  '' 
'#xFB02;' ''  'fl'  '' 
'#xFB00;' ''  'ff'  '' 
'#xFB03;' ''  'ffi' '' 
'#xFB04;' ''  'ffl' '' 

These entries are currently included in the default setting for the
iso-8859-1 encoding due to font problems at users' browsers.

-eitan


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-13 Thread Ran Gilad-Bachrach
Dear Professor Gurari,

  thank you very much. It works like charm. 

Rani

On 5/8/05, Eitan Gurari [EMAIL PROTECTED] wrote:
 
 
 I modified the bugfixes distribution to provide reduced usage of
 unicode values in iso-8859-1 output. The requests are to be made
 through commands similar to
 
htlatex file  iso8859/1/charset/less/!
 
 or by modifying the charset paths in tex4ht.env accordingly.
 Currently the only cases addressed are the ligatures 'ff' and 'fi' and
 a few non-ligature values.  Additional cases will be addressed in
 response to bug reports.
 
 -eitan
 
   tex4ht makes use of unicode letter when this is not needed. This happens 
 when
   the latex code contains the sequence ff or fi and maybe other 
 sequences. For
   example, here is a latex code and the html code generated by ht4tex and 
 htlatex. Note how
   the sequence fi was translated to #xFB01;




Bug#307647: tex4ht: unicode used when it is not needed

2005-05-11 Thread kapil
Dear Eitan,

Thanks for your work on this bug. A couple of things will slow down
the incorporation of these changes into Debian packages.

1. I am currently in transit so I won't really be able to download
   the changed files right away.

2. Debian is currently in freeze pending a release so only critical
   bug-fixes are being accepted.

With best regards,

Kapil.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-08 Thread Eitan Gurari


I modified the bugfixes distribution to provide reduced usage of
unicode values in iso-8859-1 output. The requests are to be made
through commands similar to

   htlatex file  iso8859/1/charset/less/!

or by modifying the charset paths in tex4ht.env accordingly.
Currently the only cases addressed are the ligatures 'ff' and 'fi' and
a few non-ligature values.  Additional cases will be addressed in
response to bug reports.

-eitan

  tex4ht makes use of unicode letter when this is not needed. This happens when
  the latex code contains the sequence ff or fi and maybe other sequences. 
  For
  example, here is a latex code and the html code generated by ht4tex and 
  htlatex. Note how
  the sequence fi was translated to #xFB01;


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-06 Thread Kapil Hari Paranjape
Dear Eitan,

On Thu, May 05, 2005 at 11:34:56PM -0400, Eitan Gurari wrote:
 Some background information regarding the problem.

Thanks for this info.

 The unicode.4hf mapping currently doesn't allow creation of bitmap
 fonts. For that to happen the  tex4ht.c  code needs to be modified to
 provide enhanced support for unicode.4hf files.

I wasn't thinking of making bitmap fonts for the ligatures. I
understood the requirement as being roughly why not use ascii text
in places where ascii text could suffice for conveying the content.

So I was thinking of just using 'ff' , 'fi' and so on in place
of '#xFB00;' and so on in a font file heirarchy called nolig. This
directory heirarchy would break ligatures for all the latin
characters.

It may also be possible to ask TeX to avoid ligatures during its run.

Another possibility is to check whether (X)HTML allows for ALT tags
or some CSS statement which permits font/glyph substitution.

Regards,

Kapil.

P.S. While trying to create the font files using the source I noticed
that one needs the environment variable extra_mem_top to be set to
about 10 or so in order for TeX to run successfully with the htf
source files. Is this how you run it?
--



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-06 Thread Eitan Gurari

Kapil,

I agree the ligatures shouldn't be represented by bitmaps.  To deal
just with those cases a nolig file can be a copy of

ht-fonts/iso8859/1/charset/unicode.4hf

stored at 

ht-fonts/iso8859/1/charset/nolig/unicode.4hf

augmented with entries similar to

'#xFB01;' ''  'fi' ''
'#xFB02;' ''  'fl' '' 

For such a case, a compilation can be requested with a comamnd similar
to

htlatex file  iso8859/1/charset/nolig/!

or the tex4ht.env file should have its charset directory path modified
accordingly.

TeX doesn't see the htf fonts--only the postprocessor tex4ht.s deals
with them.  The tex4ht system however requires much resources from
tex.  The tex system I run provides the following resources.

 17537 strings out of 61437 
 369958 string characters out of 4947194 
 2144172 words of memory out of 801 
 20492 multiletter control sequences out of 1+65535 
 8669 words of font info for 31 fonts, out of 100 for 1000 
 14 hyphenation exceptions out of 1000 
 36i,8n,28p,231b,2972s stack positions out of 15000i,4000n,6000p,20b,4s 

-eitan


  I wasn't thinking of making bitmap fonts for the ligatures. I
  understood the requirement as being roughly why not use ascii text
  in places where ascii text could suffice for conveying the content.
  
  So I was thinking of just using 'ff' , 'fi' and so on in place
  of '#xFB00;' and so on in a font file heirarchy called nolig. This
  directory heirarchy would break ligatures for all the latin
  characters.
  
  It may also be possible to ask TeX to avoid ligatures during its run.
  
  Another possibility is to check whether (X)HTML allows for ALT tags
  or some CSS statement which permits font/glyph substitution.

  P.S. While trying to create the font files using the source I noticed
  that one needs the environment variable extra_mem_top to be set to
  about 10 or so in order for TeX to run successfully with the htf
  source files. Is this how you run it?


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-05 Thread Ran Gilad-Bachrach
Dear Kapil,

  My main goal in using tex4ht is to share documents with people who
do not use TeX, or process the documents by other programs. For this
purpose, the problem I have reported on is important as it prevents
such use. However, for the sake of publishing a document in html
format, this is of no major concern. Thus, I accept your opinion that
this should be counted in the wish list.

Thank you for the great assistance, 

   Rani

On 5/5/05, Kapil Hari Paranjape [EMAIL PROTECTED] wrote:
 Dear Ran Gilad-Bachrach,
 
 Please see the enclosed mail from the author Eitan Gurari.
 He is planning to provide a fix in the next version. For
 the time being I think I will agree with Vassilii that this is
 really wishlist rather than important (at least as a bug for
 tex4ht---I do think it is up to text viewers/browsers that
 render unicode to do this job as correctly as possible).
 
 On Wed, May 04, 2005 at 03:04:48PM -0400, Eitan Gurari wrote:
  Unfortunately, too many people complain about this and other similar
  lack of font support problems by browsers for unicode symbols.  I'll
  try to `fix' the problem the coming weekend.  -eitan
 
 Perhaps the fix will take the form of an option for mk4ht/htlatex that
 selects non-unicode glyph substitution.
 
 I hope I have your permission. I am re-tagging this as a wishlist item.
 
 Thanks and regards,
 
 Kapil.
 --
 




Bug#307647: tex4ht: unicode used when it is not needed

2005-05-05 Thread Kapil Hari Paranjape
Dear Rani,

On Thu, May 05, 2005 at 09:06:08AM +0300, Ran Gilad-Bachrach wrote:
   My main goal in using tex4ht is to share documents with people who
 do not use TeX, or process the documents by other programs. For this
 purpose, the problem I have reported on is important as it prevents
 such use. However, for the sake of publishing a document in html
 format, this is of no major concern. Thus, I accept your opinion that
 this should be counted in the wish list.

Consequent to your e-mail I examined this a little further---so I
am re-evaluating. 

Until the world switches over to unicode ... I think it should
certainly be possible to choose to use a set of fonts that does
not use unicode for latin characters.

So I am upgrading the bug to normal.

Eitan may soon provide the possibility of latin fonts as an option
thereby causing this problem to disappear. I am trying my hand at a
solution as well.

Thanks and regards,

Kapil.
--



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-05 Thread Eitan Gurari


Kapil,

Some background information regarding the problem.

In the `old days', tex4ht provided for a given (la)tex font different
htf fonts addressing different character sets.  For instance, the
(la)tex cmr family of fonts had htf fonts under the unicode and
iso-8859-1 branches.  In the iso branch quite a few characters got
bitmap representations due to lack of native support in the iso
character set.

About half a year ago I started deleting the non-unicode htf fonts,
and provide instead unicode.4hf translation files.  When tex4ht.c
fails to find a htf font for a character set it internally creates
such a font from the unicode version using the appropriate unicode.4hf
mapping. For instance, the iso-8859-1 version of cmr.htf is created
from the unicode version of cmr.htf through the mapping provided in
the iso-8859-1 version of unicode.4hf.

The unicode.4hf mapping currently doesn't allow creation of bitmap
fonts. For that to happen the  tex4ht.c  code needs to be modified to
provide enhanced support for unicode.4hf files.

-eitan

  Eitan may soon provide the possibility of latin fonts as an option
  thereby causing this problem to disappear. I am trying my hand at a
  solution as well.


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Ran Gilad-Bachrach

Package: tex4ht
Version: 20050402.1817-1
Severity: important

tex4ht makes use of unicode letter when this is not needed. This happens when
the latex code contains the sequence ff or fi and maybe other sequences. For
example, here is a latex code and the html code generated by ht4tex and 
htlatex. Note how
the sequence fi was translated to #xFB01;


--- newfile1.tex -

%% LyX 1.3 created this file.  For more info, see http://www.lyx.org/.
%% Do not edit unless you really know what you are doing.
\documentclass[english]{article}
\usepackage[latin1]{inputenc}

\makeatletter
\usepackage{babel}
\makeatother
\begin{document}
efficient classifier
\end{document}

--- newfile1.html (tex4ht) ---

efficient classi#xFB01;er


--- newfile1.html (htlatex) --

!DOCTYPE html PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN
  http://www.w3.org/TR/html4/loose.dtd;
html 
headtitle/title
meta http-equiv=Content-Type content=text/html; charset=iso-8859-1
meta name=generator content=TeX4ht 
(http://www.cse.ohio-state.edu/~gurari/TeX4ht/mn.html)
meta name=originator content=TeX4ht 
(http://www.cse.ohio-state.edu/~gurari/TeX4ht/mn.html)
!-- html --
meta name=src content=newfile1.tex
meta name=date content=2005-04-21 09:30:00
link rel=stylesheet type=text/css href=newfile1.css
/headbody

!--l. 10--p class=noindentefficient classi#xFB01;er
/body/html




-- System Information:
Debian Release: 3.1
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: i386 (i686)
Kernel: Linux 2.6.8
Locale: LANG=he_IL, LC_CTYPE=he_IL (charmap=ISO-8859-8)

Versions of packages tex4ht depends on:
ii  libc6   2.3.2.ds1-20 GNU C Library: Shared libraries an
ii  libkpathsea32.0.2-28 path search library for teTeX (run
ii  tetex-bin   2.0.2-28 The teTeX binary files

-- no debconf information



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Kapil Hari Paranjape
Dear Ran Gilad-Bachrach,

Thanks for your report.

On Wed, May 04, 2005 at 03:58:36PM +0300, Ran Gilad-Bachrach wrote:
 tex4ht makes use of unicode letter when this is not needed. This happens when
 the latex code contains the sequence ff or fi and maybe other sequences. 
 For
 example, here is a latex code and the html code generated by ht4tex and 
 htlatex. Note how
 the sequence fi was translated to #xFB01;

Could you please tell me why you think this is a bug? Please keep the
following in mind.

1. TeX4HT tries as much as possible to be *like* TeX except that it
   outputs hypertext.

2. TeX uses ligatures whenever it encounters ff, fi, fl and so on.

3. It *is* possible for you to define an alternate mechanism to avoid
   ligatures---create your own htf files which skip the ligatures.

Thanks and best regards,

Kapil.
--



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Ran Gilad-Bachrach
Dear Kapil,

  Thank you for the prompt answer. I am not an expert in type-setting
however I have noticed two things which makes the conversion tex4ht
does problematic. First, when you open the html file using a browser,
the ff,fi,... combination look different than the rest of the text
(blurred). Second, the funny conversion makes it hard to apply
post-processors, such as spell checkers and syntax checkers to the
html file.
  you are probably right that by creating an htf file this can be
done. However, I would expect that this would be the default behavior,
hence I do not think that the user should be bothered with doing that.
Nevertheless, I might be wrong ...

  thank you once again,

   Rani

On 5/4/05, Kapil Hari Paranjape [EMAIL PROTECTED] wrote:
 Dear Ran Gilad-Bachrach,
 
 Thanks for your report.
 
 On Wed, May 04, 2005 at 03:58:36PM +0300, Ran Gilad-Bachrach wrote:
  tex4ht makes use of unicode letter when this is not needed. This happens 
  when
  the latex code contains the sequence ff or fi and maybe other 
  sequences. For
  example, here is a latex code and the html code generated by ht4tex and 
  htlatex. Note how
  the sequence fi was translated to #xFB01;
 
 Could you please tell me why you think this is a bug? Please keep the
 following in mind.
 
 1. TeX4HT tries as much as possible to be *like* TeX except that it
outputs hypertext.
 
 2. TeX uses ligatures whenever it encounters ff, fi, fl and so on.
 
 3. It *is* possible for you to define an alternate mechanism to avoid
ligatures---create your own htf files which skip the ligatures.
 
 Thanks and best regards,
 
 Kapil.
 --
 




Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Eitan Gurari

Unfortunately, too many people complain about this and other similar
lack of font support problems by browsers for unicode symbols.  I'll
try to `fix' the problem the coming weekend.  -eitan

   tex4ht makes use of unicode letter when this is not needed. This happens 
   when
   the latex code contains the sequence ff or fi and maybe other 
   sequences. For
   example, here is a latex code and the html code generated by ht4tex and 
   htlatex. Note how
   the sequence fi was translated to #xFB01;
  
  Could you please tell me why you think this is a bug? Please keep the
  following in mind.
  
  1. TeX4HT tries as much as possible to be *like* TeX except that it
 outputs hypertext.
  
  2. TeX uses ligatures whenever it encounters ff, fi, fl and so on.
  
  3. It *is* possible for you to define an alternate mechanism to avoid
 ligatures---create your own htf files which skip the ligatures.
  
  Thanks and best regards,
  
  Kapil.
  --
  
  


-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Vassilii Khachaturov
 however I have noticed two things which makes the conversion tex4ht
 does problematic. First, when you open the html file using a browser,
 the ff,fi,... combination look different than the rest of the text
 (blurred).

This looks like a browser bug to me, if a unicode char is generated for
the ligature, and the browser is showing it differently than the
surrounding chars. In general a run of text within the same font should
look the same wrt the font weight etc. Could also be a font problem,
maybe your default browser font doesn't have the ligature chars hinted
correctly while the other chars are, and you're using a scaled font.

 Second, the funny conversion makes it hard to apply
 post-processors, such as spell checkers and syntax checkers to the
 html file.

Why don't you use a Latex-aware spell checker, like ispell, on the TeX
source? As for the HTML syntax checking (i.e., validation of the SGML
markup), I doubt it should matter whether the character entities for the
ligatures are in or not.

   you are probably right that by creating an htf file this can be
 done. However, I would expect that this would be the default behavior,
 hence I do not think that the user should be bothered with doing that.
 Nevertheless, I might be wrong ...

Just a note from another user of tex4ht who thinks this is definitely not
a bug, but rather a wishlist. Personally, I can't identify with the need
to implement this feature, but could be other users might want it as well.

V.



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]



Bug#307647: tex4ht: unicode used when it is not needed

2005-05-04 Thread Kapil Hari Paranjape
Dear Ran Gilad-Bachrach,

Please see the enclosed mail from the author Eitan Gurari.
He is planning to provide a fix in the next version. For
the time being I think I will agree with Vassilii that this is
really wishlist rather than important (at least as a bug for
tex4ht---I do think it is up to text viewers/browsers that
render unicode to do this job as correctly as possible).

On Wed, May 04, 2005 at 03:04:48PM -0400, Eitan Gurari wrote:
 Unfortunately, too many people complain about this and other similar
 lack of font support problems by browsers for unicode symbols.  I'll
 try to `fix' the problem the coming weekend.  -eitan

Perhaps the fix will take the form of an option for mk4ht/htlatex that
selects non-unicode glyph substitution.

I hope I have your permission. I am re-tagging this as a wishlist item.

Thanks and regards,

Kapil.
--



-- 
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]