Re: latex2html: Doing the Perl thing II (fwd)

Ross Moore Wed, 15 Apr 1998 23:05:18 -0400

Hello Fred, Marcus, and others

At 5:39 PM -0400 4/15/98, Fred L. Drake wrote:
>[EMAIL PROTECTED] writes:
> > features and quite a few style-files hadn't been rewritten. I haven't
> > done anything on it since, but some ideas from l2h-ng have gone into l2h.
> > Unless someone else expends efforts into bringing l2h-ng up-to-date it will
> > probably never stand on its own.
>
>  So, it's dead.  Too bad; it sounds like the right approach.

No, it's not dead yet.
Marek Rouchal is very interested in exploring this approach.
We discussed it at length at the EuroTeX meeting.

I don't exactly agree with the approach myself, for LaTeX source,
but it may well be correct for a TeX2HTML or TeX2XML converter.


The big problem is as Marcus identifies:

>another. Major minus: You have to make exactly sure that you handle all the
>nesting levels right. Otherwise things can go very bad.

This is the same problem that besets TeX, when bracketing levels are
wrong or tokens do not match the pattern expected by a macro.
Whole chunks of the input can fail to be interpreted.
Eventually parsing must stop because the structure of the input
cannot be deduced.

Currently LaTeX2HTML sidesteps this kind of problem by:

 1. checking the bracketing levels early
        allowing you to abort if there are messages about unmatched braces;

 2. its `inside-out' processing order
        (which really isn't so bad now as it used to be, more below)
    which encapsulates errors within the smallest surrounding environment.

Thus processing need not stop for errors.
They can be reported at the end, and all fixed together;
rather than the infuriating `stop-edit-test' cycle needed with TeX.
In particular, most (if not all) of the image generation is done on
the first run. Subsequent error-correcting runs are generally much faster,
since most of the hard work has already been done.


A further advantage (for the future) of having environments encapsulated
this way, is that parallel-processing could be implemented effectively.
Simply allow separate processors to handle complete environments
simultaneously.
(Some care would be needed for counters; e.g. for equation-numbering.)


> > This was one of the major features of l2h-ng and it would take a big
> > boatload of work to include it in l2h. You see, one of the biggest
> > problems of l2h is that everything is processed "inside-out". That is,
> > the innermost environment is processed first. In l2h-ng processing occurs
>

Concerning "inside-out" processing...

Early versions of LaTeX2HTML called  &translate_environments
on the contents, *before* calling   &do_env_<env>

This made it impossible for sub-environments to inherit information
from the parent environment;
e.g. an {enumerate} inside another {enumerate}
could not get its 2-step numbering correct; e.g. 3a. 3b. ... .
(Also it is almost impossible to get correctly nested HTML tags
this way, when there are paragraphs inside the inner environments.)

Since v97.1 (perhaps earlier)  &translate_environments  is no longer
called automatically. Instead each   &do_env_<env>  is called first.
These subroutines must call  &translate_environments  themself,
at a point where it is appropriate to do so.
(This is like having separate  &do_begin_env_... and  &do_end_env_....
 subroutines, but combined into a single subroutine. )



>  I was thinking, at this point, of simply having the
>get_next_argument() function knowing whether an environment was being
>processed or not; getting the sequence right is clearly a much bigger
>task.  In particular, a global variable could be set to "env" just
>before do_env_*() is called, and to "cmd" just before do_cmd_*() is
>called.  This would allow get_next_argument() to "get it right,"
>avoiding a lot of the pain associated argument extraction.  The most
>difficult part is finding all the places such an assignment would need
>to be made; less difficult for someone familiar with the structure of
>the sources and Perl syntax than for me.

Currently  &do_cmd_*()  is fed 2 arguments, though frequently it is OK
to ignore the 2nd one:

 1st argument:  string-type
        subsequent text at the current level of environment-nesting
 2nd argument:  list-type
        list of currently open HTML tags (thanks Marcus, from l2h-ng)

One way to reduce the memory requirement is to replace the 1st argument
by a  `*-reference' rather than the string itself.
Thus the  &do_cmd_*  routine would typically start:

sub do_cmd_<cmd> {
    local(*_,@open_tags) = @_ ;
    ....

Other parts of the subroutine, including the parameter-reading parts,
need not change;  *except* ...
 ... now the return value need only be the new HTML code constructed
by the subroutine, not the whole environment.


The problem with this approach is that it is totally incompatible
with subroutines already constructed for the current approach.
*Every* instance must be found and changed, in all LaTeX2HTML's
support files and packages.
Any private user-defined macros will fail, perhaps by duplicating
large strings --- indeed there could easily be infinite looping.

Doing this is the analog of Marcus' *big boatload*...

>This was one of the major features of l2h-ng and it would take a big
>boatload of work to include it in l2h. You see, one of the biggest



> > in standard TeX order. This is done by treating environments exactly like
>...
> > LaTeX. For example, an environment can be opened in one file and closed in

I don't believe that people actually do this sort of thing in LaTeX.
Besides, the current mechanism of slurping all input files avoids
that kind of problem --- at the expense of memory, of course.


>  It also means I could do the indexing the way I want to!  (Or at
>least the same way I do in LaTeX.)  *THAT* is a big issue for me.

That's multiple indexes, yes ?
It shouldn't be too difficult to extend the  makeidx.perl  package
to handle this.


> > another. Major minus: You have to make exactly sure that you handle all the
> > nesting levels right. Otherwise things can go very bad.
>
>  But it sounds like this is dealt with entirely by the do_cmd_begin()
>and do_cmd_end() functions; why would this be a problem?

Not really.
The TeX-like way of processing is too fragile a structure.
If every piece of syntax is exactly correct, and all environments
are correctly balanced, etc.  then it works fine.
But really, that is only ever true of the final form of the document,
after much hardship in making it be that way.

LaTeX2HTML is in many ways *more complicated* than TeX.
Developing new features would be almost impossible without the
encapsulation of environments, as it is done now.

At least that is my view.
Certainly I agree that the internal structure of LaTeX2HTML
processing can be made more efficient, memory-wise.
But I don't think that throwing away the encapsulation of
environments is a necessary part of this.


************************************************

As for reading macro-arguments, which started this thread,
the best way is to use a construction like:

  local($myparam);
  $myparam=&missing_braces unless (
       (s/$next_pair_pr_rx/$myparam=$2;''/e)
       ||(s/$next_pair_rx/$myparam=$2;''/e));

This achieves the following:
  A.  handles both *phases* of brace-processing;
  B.  gives a warning message if braces were not found,
        (but returns the next letter or control-name anyway).

Most of the  do_cmd_*  subroutines in the main script use this,
or a variant thereof.
You can easily edit this to get the bracket-ID, if needed.

OK, one could define something like:

sub get_next_argument {
  my ($param,$br_id,$pat);
  $param = &missing_braces unless (
       (s/$next_pair_pr_rx/$br_id=$1;$param=$2;$pat=$&;''/eo)
       ||(s/$next_pair_rx/$br_id=$1;$param=$2;$pat=$&;''/eo));
  ($param,$br_id,$pat)
}

...to make life a bit easier for writing packages.
Note that each instance would have to dispose of the unwanted info.

It would probably shorten the latex2html scripts by several hundred lines.
 ;-)


************************************************

As general advice for package-writers...

Do as much as possible using LaTeX's  \newcommand  and  \newenvironment  etc.

If necessary, have one definition for LaTeX and another (usually simpler)
definition for LaTeX2HTML, using conditional code:

%begin{latexonly}
  \newenvironment{myenv}{....}{....}
%end{latexonly}
\begin{htmlonly}
  \newenvironment{myenv}{....}{....}
\end{htmlonly}

Alternatively, put the LaTeX2HTML definitions into a separate file,
called  mydefs.tex , say.
Then  \usepackage{mydefs}  loads  mydefs.sty  for LaTeX
but loads  mydefs.tex  with LaTeX2HTML .

Other loading strategies are possible, using conditional code.


The main advantage of doing this is to use existing implementations
of code for  mathematics, HTML-tables, HTML-list constructions, etc.


If you find that some fundamental HTML constructions are not already
adequately implemented...

  ... only then is it time to do some Perl programming.


******************************************************


Hope this helps,

        Ross Moore





~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Ross Moore                             email: [EMAIL PROTECTED]
Mathematics Department                 phone:      +612 9850 8955
Macquarie University                     fax:      +612 9850 8114
Sydney, NSW 2109                    Internet:
Australia                   http://www-math.mpce.mq.edu.au/~ross/

                ***************************

for the best in (La)TeX-nical typesetting and Web page production
join the  TeX Users Group (TUG) --- browse at  http://www.tug.org

                 <[EMAIL PROTECTED]>
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Re: latex2html: Doing the Perl thing II (fwd)

Reply via email to