Hello Stuart,

On 10/12/2009, at 2:50 AM, Stuart Rossiter wrote:

Hi,

  This revisits issues raised (but not resolved) in a 2003 post:
http://tug.org/mailman/htdig/latex2html/2003-August/002400.html

It appears that latex2html is (still) converting em- and en-dashes to
-- and - respectively. Since hyphens are also left as -, there is then
no way to distinguish (in the HTML) between things that were en-dashes
and normal hyphens (so you can't do the conversions to &endash; etc.
manually, even if you want to).

Also, the main script has do_cmd_texteemdash and do_cmd_textendash
routines (to convert to --- and -- respectively), but these don't seem
to get used when you explicitly use \textemdash and \textendash
commands, which I thought would be a way round this problem (it still
does the conversions to -- and -).

No, that is not entirely correct.
The coding has:

# these can be overridded in charset (.pl) extension files:
sub do_cmd_textemdash { join('','---', $_[0]);}
sub do_cmd_textendash { join('','--', $_[0]);}

So if you set the charset then you can get other results.

Alternatively, you can override these in a configuration file,
as that gets read after the main script has been loaded.



So it appears that:

-- latex2html can't distinguish these dashes properly (I assume that,
as for quotes, this is an issue with being able to definitively
identify them), although it's distinguishing *something* in doing the
conversions to -- and - ! (so maybe this *can* be fixed?)

It is also a matter of output encodings.

By default, LaTeX2HTML was written to produce Latin 1 output,
that is, ISO-8859-1 encoding.
This does not include single characters for endash and emdash.

If you want single characters, and HTML coding that validates,
then you must either use entities, or expand the charset, or both.
There are switches  -unicode  and  -entities  for this.

With the  -unicode  switch you should get  –  and  —
respectively, for  --  and  ---  within normal paragraphs.

With switches  -unicode -entities  then the parameter entities
are supposed to be translated into named entites:
    –  and   &emdash;

Or with switches   -unicode -utf8   then you should get
the correct single characters in UTF8 encoding.



-- there is also no way to "preserve" the dashes from the original in
a way which would allow for accurate manual adjustments afterwards.

This statement is true when you do not specify  -unicode .
It is not true when you do include this switch.

LaTeX2HTML was written at a time when browser support for Unicode
was very flaky indeed. That is why the defaults are what they are.
Since then web technologies have advanced considerably, and other
tools do quite a good job of translating LaTeX coding into HTML,
or XHTML or XML.

On the other hand, customising LaTeX2HTML is not that hard,
 **provided** you can use Perl, and have a good understanding
of just what it is that you really want to do.



Am I missing something, or is there any advice people can offer?


Hopefully the above helps.


Thanks in advance,
Stuart


Cheers,

        Ross

------------------------------------------------------------------------
Ross Moore                                       [email protected]
Mathematics Department                           office: E7A-419
Macquarie University                             tel: +61 (0)2 9850 8955
Sydney, Australia  2109                          fax: +61 (0)2 9850 8114
------------------------------------------------------------------------



_______________________________________________
latex2html mailing list
[email protected]
http://tug.org/mailman/listinfo/latex2html

Reply via email to