Re: latex2html: Doing the Perl thing II (fwd)

Fred L. Drake Thu, 16 Apr 1998 20:41:03 -0400

Ross Moore writes:
 > No, it's not dead yet.

  Sorry; I was referring only to the implementation and not the
approach.  Approaches never die, they just return in another usenet
thread.  ;-)

 > I don't exactly agree with the approach myself, for LaTeX source,
 > but it may well be correct for a TeX2HTML or TeX2XML converter.

  I'm sure you gave well-considered reasons for this at EuroTeX; is a
synopsis available online for those who weren't there?  Failing that,
a short summary would be appreciated.

Marcus:
 > >another. Major minus: You have to make exactly sure that you handle all the
 > >nesting levels right. Otherwise things can go very bad.

Ross:
 > This is the same problem that besets TeX, when bracketing levels are
 > wrong or tokens do not match the pattern expected by a macro.

  Is this is problem that needs to be dealt with?  Detected, perhaps,
but I don't see a need to work around it.  If LaTeX fails on the
input, I think it's fine for l2h to fail as well.  It seems easier to
get appropriate processing done if there's only one approach to
processing, shared by each tool (latex, l2h, etc.).  The processing
module is a significant aspect of using TeX-based systems.  I'd say
that l2h *should* fall into that category, even though the
implementation is distinct.  This isn't an issue of "purity" so much
as a matter of user expectations.

 > Currently LaTeX2HTML sidesteps this kind of problem by:
 > 
 >  1. checking the bracketing levels early
 >      allowing you to abort if there are messages about unmatched braces;
 > 
 >  2. its `inside-out' processing order
 >         (which really isn't so bad now as it used to be, more below)
 >     which encapsulates errors within the smallest surrounding environment.
 > 
 > Thus processing need not stop for errors.
 > They can be reported at the end, and all fixed together;
 > rather than the infuriating `stop-edit-test' cycle needed with TeX.
 > In particular, most (if not all) of the image generation is done on
 > the first run. Subsequent error-correcting runs are generally much faster,
 > since most of the hard work has already been done.
 > 
 > 
 > A further advantage (for the future) of having environments encapsulated
 > this way, is that parallel-processing could be implemented effectively.
 > Simply allow separate processors to handle complete environments
 > simultaneously.
 > (Some care would be needed for counters; e.g. for equation-numbering.)

  Yes, the side-effects issue is a real kicker.  This is an issue for
tex/latex as well.  Another reason to maintain the tex-based
processing order.  The process being simulated is inherantly
sequential; parallelization makes sense only for a subset of latex
documents.

 > Since v97.1 (perhaps earlier)  &translate_environments  is no longer
 > called automatically. Instead each   &do_env_<env>  is called first.
 > These subroutines must call  &translate_environments  themself,
 > at a point where it is appropriate to do so.

  So I'm supposed to be calling translate_environments() for each
chunk of input data (parameter, content) that I have access to within
a do_env_*() function?  (This is a good reason to have an extension
manual somewhere; figuring out the source is *very* painful!  Other
recent threads have also indicated this.)

 >  2nd argument:  list-type
 >      list of currently open HTML tags (thanks Marcus, from l2h-ng)

  I'm not sure the HTML context is that useful.  The latex context
could be interesting, but only if it could be annotated at each
level.

 > One way to reduce the memory requirement is to replace the 1st argument
 > by a  `*-reference' rather than the string itself.
 > Thus the  &do_cmd_*  routine would typically start:
 > 
 > sub do_cmd_<cmd> {
 >     local(*_,@open_tags) = @_ ;
 >     ....

  I have no idea what a `*-reference' is in Perl, but I'd be willing
to learn.  ;-)

 > Other parts of the subroutine, including the parameter-reading parts,
 > need not change;  *except* ...
 >  ... now the return value need only be the new HTML code constructed
 > by the subroutine, not the whole environment.

  How would this work?  If I have an environment for which both
starting and ending HTML needs to be generated, how does the
processing engine know where to insert the contents of the
environment?  I understand how it can work for do_cmd_*().

 > The problem with this approach is that it is totally incompatible
 > with subroutines already constructed for the current approach.
 > *Every* instance must be found and changed, in all LaTeX2HTML's

  *I* could live with this, but obviously can't speak for others on
this.

 > Any private user-defined macros will fail, perhaps by duplicating
 > large strings --- indeed there could easily be infinite looping.

  I haven't found infinite looping too difficult to achieve as is!
But that may be my lack of Perl experience creeping up on me.

 > I don't believe that people actually do this sort of thing in LaTeX.
 > Besides, the current mechanism of slurping all input files avoids
 > that kind of problem --- at the expense of memory, of course.

  I think you're probably right on this one.  The problem is the
memory consumption.  For small documents memory isn't a problem, but
for large documents it's a huge problem, especially if l2h *thinks* it 
needs images.

 > That's multiple indexes, yes ?
 > It shouldn't be too difficult to extend the  makeidx.perl  package

  I'll look into this.  

 > > > another. Major minus: You have to make exactly sure that you handle all the
 > > > nesting levels right. Otherwise things can go very bad.
 > >
 > >  But it sounds like this is dealt with entirely by the do_cmd_begin()
 > >and do_cmd_end() functions; why would this be a problem?
 > 
 > Not really.
 > The TeX-like way of processing is too fragile a structure.
 > If every piece of syntax is exactly correct, and all environments
 > are correctly balanced, etc.  then it works fine.

  It's certainly true by the time you're ready to process the
document.  Since l2h seems to need the .aux file, is this a problem?
Even if l2h didn't need the .aux file, I'm not sure this is a real
issue.  I expect tools to be strict; it keeps me honest.

 > Certainly I agree that the internal structure of LaTeX2HTML
 > processing can be made more efficient, memory-wise.
 > But I don't think that throwing away the encapsulation of
 > environments is a necessary part of this.

  I don't think that "encapsulation" of the environments is a bad
thing; what is important is the processing order.  I want everything
before an environment to be processed first, then the "start this
environment" code, the environment contents, and then the "end this
environment" code.  I'm sure it can be done in Perl (l2h-ng
appearantly proved that), and there are several ways this can be
implemented.  But some LaTeX packages rely on the order of side
effects to achieve correct results; it is important to support this
as best possible.  This does not preclude scarfing up the whole
document and processing it as a big string in memory, nor is this a
problem for the current translation of { } to mmagic numbered stuff
that changes for different phases of processing.

 > ************************************************
 > 
 > As for reading macro-arguments, which started this thread,
...
 > It would probably shorten the latex2html scripts by several hundred lines.

  Ok, I just wrote these:

------------------------------------------------------------------------
sub next_argument_id{
    my ($param,$br_id);
    $param = missing_braces()
        unless ((s/$next_pair_pr_rx/$br_id=$1;$param=$2;''/eo)
                ||(s/$next_pair_rx/$br_id=$1;$param=$2;''/eo));
    ($param, $br_id);
}

sub next_argument{
    my ($param,$br_id) = next_argument_id();
    $param;
}

sub next_optional_argument{
    my($param,$rx) = ('', "^\\s*(\\[([^]]*)\\])?");
    s/$rx/$param=$1;''/eo;
    $param;
}
------------------------------------------------------------------------

  Using these throughout my code cut out about 10% (even after the
addition of these lines), measured using wc, and is a lot easier to
read.

 > If necessary, have one definition for LaTeX and another (usually simpler)
 > definition for LaTeX2HTML, using conditional code:

  This seems incredibly painful to me, but maybe I'm just a little
weird.  Using a separate file there's no question about who processes
what, and it's not that hard to keep the document interface aligned.
The processing requirements can be very different.

 > Alternatively, put the LaTeX2HTML definitions into a separate file,
 > called  mydefs.tex , say.
 > Then  \usepackage{mydefs}  loads  mydefs.sty  for LaTeX
 > but loads  mydefs.tex  with LaTeX2HTML .

  Hm.  Does this work for document classes to?  If I have howto.cls
and howto.tex, is howto.tex loaded by l2h when it sees
\documentclass{howto}?  This is not good if howto.tex is, say, a
template for howto class documents!  That's a minor problem, though,
and renaming the template wouldn't be that big a deal.


  -Fred

--
Fred L. Drake, Jr.
[EMAIL PROTECTED]
Corporation for National Research Initiatives
1895 Preston White Drive    Reston, VA  20191
Re: latex2html: Doing the Perl thing II (fwd)

Reply via email to