On Wed, 2002-10-23 at 13:43, Marco Antoniotti wrote: 
> 
> ... Lots of stuff deleted.
> 
> Have you checked www.edcom.com/~edward?  (Followi the link "technical papers")

Indeed I'm familiar with "A Text Processing Language Should be First a
Programming Language": 
http://citeseer.nj.nec.com/119321.html

Something similar is Scheme Scribe:
http://www-sop.inria.fr/mimosa/fp/Scribe/

I've just come across: RICHARD FURUTA, "Important papers in the history
of document preparation systems: basic sources" 
http://cajun.cs.nott.ac.uk/compsci/epo/papers/volume5/issue1/ep057rf.pdf

Along with a full text collection of papers here:
http://cajun.cs.nott.ac.uk/compsci/epo/papers/epoddtoc.html

I am of the opinion that documents can be written/generated more
efficiently using a powerful and general programming language. By
separating code and data as a core principle XML and the resulting
toolchain is misguided. My first attempt at a programmable document
format is discussed here: http://macrology.co.nz/

Attempting to simultaneously support XHTML and LaTeX output has been
onerous. HTML is low level with little logical document markup. Simple
constructions like footnotes or separate pages are not even natively
supported. But supporting them is easy compared to transforming
complicated TeX constructions such as tables.

While worrying that I would not have enough time to robustly support
enough document constructs I became aware of TeX4ht. This product does a
remarkable job of converting TeX/LaTeX to HTML and is distinguished by
its use of TeX the program to process the transformation. This is the
best way to approach any transformation of (La)TeX --> HTML because TeX
is about the only application that can robustly transform TeX input. TeX
may be bug free but it is also a nightmarishly complicated iterative
system that is extremely difficult to duplicate.

Right now I'm at the point that I have decided to rewrite the document
format as functional LaTeX. I'm going to abandon my introduction of
iterative tricks to attain top level forms and see how far the fully
functional approach can take me.

I'll concentrate on finishing it before I get involved in further
discussions.

If you did check out my site you'll see a comment about the format
having full UTF-8 Unicode support and being evaluated by CLISP. In
trying to support CMUCL I have to currently modify that aim. TeX is
already very expressive using pure ASCII so it's not a big deal (and
while CMUCL does not understand UTF-8 it has no problem processing 8-bit
strings). I also understand there is ongoing work on Unicode support for
CMUCL. If it was possible to deal with variable length 8-bit encoded
strings I'd be perfectly happy to stay with 8-bit strings forever.
Though internally variable-length characters would raise lots of
difficult issues.

One thing I would bypass (with the greatest of respect) is 16-bit
Unicode characters. This character size is already too small to support
the whole of Unicode. I would move straight to 32-bit characters and
effectively bypass what are now entrenched mistakes that other languages
like Java have made:

"10 Reasons We Need Java 3.0"
http://www.onjava.com/pub/a/onjava/2002/07/31/java3.html

"7. Extend chars to four bytes.

"Whether the char type is primitive or an object, the truth is that
Unicode is not a two-byte character set. This was perhaps not so
important in the last millennium when Unicode characters outside the
basic multilingual plane were just a theoretical possibility. As of
version 3.2, however, Unicode has about 30,000 more characters than can
be squeezed into two bytes. Four-byte characters include many
mathematical and most musical symbols. In the future it's also likely to
encompass fictional scripts like Tolkien's Tengwar and dead languages
like Linear B. Currently, Java tries to work around the problem by using
surrogate pairs, but the acrobatics required to properly handle these is
truly ugly, and already causing major problems for systems like XML
parsers that need to deal with this ugliness.

"Whether Java promotes the char type to an object or not, it needs to
adopt a model in which characters are a full four bytes. If Java does go
to fully object-oriented types, it could still use UTF-16 or UTF-8
internally for chars and strings to save space. Externally, all
characters should be created equal. Using one char to represent most
characters but two chars to represent some is too confusing. You
shouldn't have to be a Unicode expert just to include a little music or
math in your strings."


I would much rather waste some space now than have to shortly deal with
trying to support out of range Unicode characters on a system that only
supports 16-bit character codes. I'm not sure what approach is being
taken for CMUCL. If there is an opportunity to move straight to 32-bit
characters codes in CMUCL I would strongly advise that path is taken.
You would not only leapfrog Java in simplicity of Unicode support but
also CLISP (which also uses 16-bit character codes internally):
http://clisp.sourceforge.net/impnotes/characters.html

Well I'm off to fix and redesign the document format. Thanks everyone.

Regards,
Adam

PS: I'm reevaluating my decision to GPL the resulting document
format/CMS (refer bottom of front page). The extra freedom provided by
CMUCL's most liberal licensing is quite infectious.


Reply via email to