On Wed, 2002-10-23 at 13:43, Marco Antoniotti wrote: > > ... Lots of stuff deleted. > > Have you checked www.edcom.com/~edward? (Followi the link "technical papers")
Indeed I'm familiar with "A Text Processing Language Should be First a Programming Language": http://citeseer.nj.nec.com/119321.html Something similar is Scheme Scribe: http://www-sop.inria.fr/mimosa/fp/Scribe/ I've just come across: RICHARD FURUTA, "Important papers in the history of document preparation systems: basic sources" http://cajun.cs.nott.ac.uk/compsci/epo/papers/volume5/issue1/ep057rf.pdf Along with a full text collection of papers here: http://cajun.cs.nott.ac.uk/compsci/epo/papers/epoddtoc.html I am of the opinion that documents can be written/generated more efficiently using a powerful and general programming language. By separating code and data as a core principle XML and the resulting toolchain is misguided. My first attempt at a programmable document format is discussed here: http://macrology.co.nz/ Attempting to simultaneously support XHTML and LaTeX output has been onerous. HTML is low level with little logical document markup. Simple constructions like footnotes or separate pages are not even natively supported. But supporting them is easy compared to transforming complicated TeX constructions such as tables. While worrying that I would not have enough time to robustly support enough document constructs I became aware of TeX4ht. This product does a remarkable job of converting TeX/LaTeX to HTML and is distinguished by its use of TeX the program to process the transformation. This is the best way to approach any transformation of (La)TeX --> HTML because TeX is about the only application that can robustly transform TeX input. TeX may be bug free but it is also a nightmarishly complicated iterative system that is extremely difficult to duplicate. Right now I'm at the point that I have decided to rewrite the document format as functional LaTeX. I'm going to abandon my introduction of iterative tricks to attain top level forms and see how far the fully functional approach can take me. I'll concentrate on finishing it before I get involved in further discussions. If you did check out my site you'll see a comment about the format having full UTF-8 Unicode support and being evaluated by CLISP. In trying to support CMUCL I have to currently modify that aim. TeX is already very expressive using pure ASCII so it's not a big deal (and while CMUCL does not understand UTF-8 it has no problem processing 8-bit strings). I also understand there is ongoing work on Unicode support for CMUCL. If it was possible to deal with variable length 8-bit encoded strings I'd be perfectly happy to stay with 8-bit strings forever. Though internally variable-length characters would raise lots of difficult issues. One thing I would bypass (with the greatest of respect) is 16-bit Unicode characters. This character size is already too small to support the whole of Unicode. I would move straight to 32-bit characters and effectively bypass what are now entrenched mistakes that other languages like Java have made: "10 Reasons We Need Java 3.0" http://www.onjava.com/pub/a/onjava/2002/07/31/java3.html "7. Extend chars to four bytes. "Whether the char type is primitive or an object, the truth is that Unicode is not a two-byte character set. This was perhaps not so important in the last millennium when Unicode characters outside the basic multilingual plane were just a theoretical possibility. As of version 3.2, however, Unicode has about 30,000 more characters than can be squeezed into two bytes. Four-byte characters include many mathematical and most musical symbols. In the future it's also likely to encompass fictional scripts like Tolkien's Tengwar and dead languages like Linear B. Currently, Java tries to work around the problem by using surrogate pairs, but the acrobatics required to properly handle these is truly ugly, and already causing major problems for systems like XML parsers that need to deal with this ugliness. "Whether Java promotes the char type to an object or not, it needs to adopt a model in which characters are a full four bytes. If Java does go to fully object-oriented types, it could still use UTF-16 or UTF-8 internally for chars and strings to save space. Externally, all characters should be created equal. Using one char to represent most characters but two chars to represent some is too confusing. You shouldn't have to be a Unicode expert just to include a little music or math in your strings." I would much rather waste some space now than have to shortly deal with trying to support out of range Unicode characters on a system that only supports 16-bit character codes. I'm not sure what approach is being taken for CMUCL. If there is an opportunity to move straight to 32-bit characters codes in CMUCL I would strongly advise that path is taken. You would not only leapfrog Java in simplicity of Unicode support but also CLISP (which also uses 16-bit character codes internally): http://clisp.sourceforge.net/impnotes/characters.html Well I'm off to fix and redesign the document format. Thanks everyone. Regards, Adam PS: I'm reevaluating my decision to GPL the resulting document format/CMS (refer bottom of front page). The extra freedom provided by CMUCL's most liberal licensing is quite infectious.
